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Abstract 

This  paper  is  a  survey  of  the  current  machine  translation  research 
in  the  US,  Europe  and  Japan.  A  short  history  of  machine  transla¬ 
tion  is  presented  first,  followed  by  an  overview  of  the  current  research 
work.  Representative  examples  of  a  wide  range  of  different  approaches 
adopted  by  machine  translation  researchers  are  presented.  These  are 
described  in  detail  along  with  a  discussion  of  the  practicalities  of  scaling 
up  these  approaches  for  operational  environments.  In  support  of  this 
discussion,  issues  in,  and  techniques  for,  evaluating  machine  translation 
systems  are  addressed. 


1  Introduction 

Machine  translation  (MT),  i.e.,  translation  from  one  natural  language  into 
another  by  means  of  a  computerized  system,  has  been  a  particularly  difficult 
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problem  in  the  area  of  artificial  intelligence  ( AI)  for  over  four  decades.  Early 
approaches  to  translation  failed  in  part  because  the  interactive  effects  of 
complex  phenomena  made  translation  appear  to  be  unmanageable.  Later 
approaches  to  the  problem  have  achieved  varying  degrees  of  success.  In 
general,  most  MT  systems  do  not  attempt  to  achieve  fully-automatic,  high- 
quality  translations,  but  instead  strive  for  a  level  of  translation  that  suits 
the  basic  needs  of  the  user,  perhaps  requiring  controlled  input  or  revisions 
(post-editing)  or  both  to  arrive  at  the  final  result. 

This  paper  is  a  survey  of  the  current  MT  research  in  the  US,  Europe 
and  Japan.  While  a  number  of  MT  surveys  have  been  written  (see,  e.g., 
[101],  [100],  [119],  [135],  [139],  [153],  [155],  [196]),  this  one  discusses  a  wide 
range  of  current  research  issues  in  light  of  results  obtained  from  a  survey  and 
evaluation  project  conducted  by  Mitre  ([112],  [27],  [26]).  During  this  project 
we  evaluated  16  systems  (10  operational  and  6  under  development)  and 
also  studied  7  U.S.  research  systems.  Because  a  number  of  innovative  MT 
approaches  have  come  about  since  the  completion  of  the  Mitre  study,  we  also 
include  some  discussion  about  more  recent  research  paradigms.  However,  we 
do  not  attempt  to  describe  all  MT  research  in  detail.  Rather,  we  present 
approaches  as  representative  examples  of  a  wide  range  of  different  approaches 
adopted  by  MT  researchers. 

The  next  section  provides  a  brief  description  of  the  history  of  MT.  In 
section  3,  we  discuss  the  types  of  challenges  (both  linguistic  and  operational) 
that  one  must  consider  in  developing  a  MT  system.  Section  4  describes 
three  architectural  designs  that  are  used  for  MT.  Following  this,  we  compare 
translation  systems  along  the  axis  of  research  paradigms  (section  5);  these 
include  linguistic,  non-linguistic,  and  hybrid  approaches.  We  then  discuss 
the  difficult  task  of  evaluating  a  MT  system  in  section  6.  Finally,  we  conclude 
with  a  discussion  about  what  results  we  should  expect  to  see  in  the  future 
and  where  more  effort  needs  to  be  placed. 

2  The  History  of  MT 

Numerous  attempts  have  been  made  in  the  past,  both  in  the  United  States 
and  Europe,  to  automate  various  steps  in  the  translation  process.  These 
attempts  range  from  simple  on-line  bilingual  dictionaries,  terminology  data 
banks,  and  other  translation  aids  to  complete  MT  systems.  Much  work 
was  done  in  the  1950s  and  1960s  toward  achieving  MT.  However,  the  1966 
Automatic  Language  Processing  Advisory  Committee  (ALPAC)  report  [5] 
condemned  those  efforts,  citing  poor-quality  technology  and  the  availabil¬ 
ity  of  inexpensive  manual  labor  as  negative-cost  factors.  These  early  efforts 
failed  for  several  reasons,  not  the  least  of  which  was  the  unreasonably  high 
expectation  for  perfect  translation  without  having  the  basic  theoretical  foun¬ 
dation  to  achieve  this.  The  ALPAC  report  caused  a  major  reduction  in  U.S. 
research  and  development  (R&D)  efforts  in  the  area  of  MT  in  favor  of  some 
related  areas,  such  as  computational  linguistics  and  artificial  intelligence, 
that  subsequently  provided  a  better  theoretical  foundation  for  current  MT 
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R&D.  Nevertheless,  reduced  but  still  significant  MT  research  did  continue  at 
such  places  as  the  University  of  Texas/Austin,  Brigham  Young  University, 
and  Georgetown  University.  The  ALPAC  report  also  affected  the  R&D  ef¬ 
fort  in  Europe,  but  again,  significant  research  continued  in  Western  Europe 
and  the  USSR.  An  important  side-effect  of  the  reduced  R&D  funding  for 
MT  was  the  stimulation  of  a  number  of  commercial  MT  endeavors  by  those 
displaced  from  the  research  centers.  This  resulted  in  most  of  our  current 
operational  MT  systems,  including  Systran  and  Logos. 

In  the  late  1960s,  MT  R&D  was  initiated  in  Canada,  driven  by  the  bilin¬ 
gual  status  of  the  country.  In  the  late  1970s  and  the  1980s,  two  significant 
events  occurred.  The  first  was  the  formation  of  the  EUROTRA  project  by 
the  European  Communities  (EC)  to  provide  MT  of  all  the  member  nations’ 
languages.  The  second  was  the  realization  of  both  Japanese  government 
and  industry  that  MT  of  Japanese  to  and  from  European  languages  first, 
and  later  to  and  from  other  Asian  languages,  was  important  to  their  eco¬ 
nomic  progress.  Thus  far  the  EUROTRA  project  has  failed  to  meet  its 
goal  of  complete  intertranslation  of  all  the  member  languages;  however,  it 
has  initiated  important  new  research  in  MT  and  computational  linguistics, 
and  augmented  existing  MT  research.  Commercial  MT  systems  supporting 
limited  language  pairs  are  now  beginning  to  emerge  from  this  effort.  The 
EUROTRA  project  continues  with  somewhat  narrowed  goals  in  that  a  large 
single  system  is  not  being  attempted.  The  Japanese  efforts  have  produced  a 
plethora  of  prototypes  and  commercially  available  operational  systems,  most 
based  on  established  technology.  Japanese  research  in  MT,  while  never  ex¬ 
tensive,  has  been  increasing  both  in  quality  and  funding.  A  small  effort  is 
also  under  way  in  the  former  Soviet  Union. 

In  the  United  States,  research  and  commercial  development  have  ex¬ 
panded  considerably  since  the  mid-1980s.  In  part,  this  expansion  has  been 
stimulated  by  the  desire  for  more  foreign  markets.  Government  funding 
has  increased,  and  MT  research  has  evolved  out  of  computational  linguistics 
work  at  such  places  as  New  Mexico  State  University,  Carnegie  Mellon  Uni¬ 
versity,  and  University  of  Maryland.  Several  commercial  systems  have  been 
developed,  providing  translation  capabilities  that  are  limited,  but  effective 
for  some  applications.  Several  small  companies  are  developing  and  market¬ 
ing  more  complete  MT  systems  based  on  more  recent  technology.  The  U.S. 
Government,  through  its  civil,  military,  and  intelligence  branches,  is  show¬ 
ing  increased  interest  in  using  and  supporting  MT  systems.  A  market  for 
MT  is  developing  among  international  and  domestic  corporations  that  are 
competing  in  the  world  market. 

In  summary,  work  on  MT  has  been  under  way  for  over  four  decades,  with 
various  ups  and  downs.  The  ALPAC  report  of  the  mid-1960s  was  a  serious, 
but  by  no  means  devastating,  setback  to  the  effort,  and  the  current  trend  is 
toward  increased  support.  This  history  is  illustrated  by  Figure  1,  adapted 
from  a  chart  by  Wilks  [223]. 
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Figure  2:  Three  Categories  of  Linguistic  Considerations 


3  Translation  Challenges 

This  section  discusses  the  types  of  challenges  that  one  must  consider  in  devel¬ 
oping  a  MT  system.  We  examine  these  challenges  along  two  dimensions,  the 
first  pertaining  to  different  types  of  linguistic  considerations  (e.g.,  syntactic 
word  order  and  semantic  ambiguity)  and  the  second  pertaining  to  different 
types  of  operational  considerations  (e.g.,  extensibility,  maintainability,  and 
user  interface). 

3.1  Linguistic  Considerations 

There  are  three  main  categories  into  which  the  linguistic  considerations  fall; 
language  understanding,  language  generation,  and  the  mapping  between  lan¬ 
guage  pairs.  Roughly,  these  are  related  as  shown  in  figure  2. 

Regarding  the  first  category,  there  have  been  many  arguments  in  the 
past  both  for,  and  against,  the  idea  that  a  complete  understanding  of  the 
source  text  is  necessary  for  adequate  MT  (see  [18],  [45],  [134],  [155],  [157], 
among  others).  In  more  recent  years,  however,  researchers  have  started  to 
concentrate  on  the  issue  of  whether  it  is  possible  to  achieve  a  satisfactory 
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translation  with  a  minimal  amount  of  understanding  (see,  e.g.,  [25]).  With 
respect  to  this  issue,  the  areas  to  consider  are:  syntactic  ambiguity,  lexical 
ambiguity,  semantic  ambiguity,  and  contextual  ambiguity.  Each  is  addressed 
below. 

In  English,  syntactic  ambiguity  arises  in  many  contexts  including  the 
attachment  of  prepositional  phrases,  coordination,  and  noun  compounding. 
For  other  languages  the  types  of  syntactic  ambiguities  will  vary.  The  diffi¬ 
culty  of  prepositional  phrase  attachment  is  illustrated  in  the  following  En¬ 
glish  sentence: 

(1)  Syntactic  Ambiguity 

I  saw  the  man  on  the  hill  with  the  telescope 

Here,  we  have  no  way  to  determine  whether  the  telescope  belongs  to  the 
man  or  the  hill.  However,  in  such  cases,  particularly  for  similar  languages,  it 
may  not  be  necessary  to  resolve  such  an  ambiguity  since  a  particular  source- 
language  syntactic  ambiguity  may  transfer  to  the  target  language  and  still 
be  understandable  to  human  readers. 

In  the  case  of  lexical  ambiguity,  the  choice  between  two  possible  meanings 
of  a  source-language  lexical  item  is  often  easily  resolved  if  enough  syntactic 
context  is  available.  Consider  the  following  example:1 

(2)  Lexical  Ambiguity 

E:  book 

S:  libro,  reset' var 

The  English  word  book  would  be  translated  to  the  Spanish  noun  libro  if  it 
appeared  after  the  word  the  or  to  the  verb  reservar  if  it  appeared  before  the 
phrase  the  flight. 

A  more  formidable  problem  is  semantic  ambiguity;  the  resolution  of  this 
type  of  ambiguity  falls  outside  of  the  realm  of  syntactic  and  lexical  knowledge 
as  in  the  following  examples: 

(3)  Semantic  Ambiguity 

(i)  Homography: 

E:  ball 

S:  pelota,  bade 

(ii)  Polysemy: 

E:  kill 

S:  malar,  acabar 

Many  words,  such  as  ball ,  have  distinctly  different  meanings  (homography); 
MT  systems  are  forced  to  choose  the  correct  meaning  of  the  source  language 
constituent  in  these  cases  (e.g.,  whether  ball  corresponds  to  a  spherical  object 
(pelota)  or  a  formal  dance  (baile)).  Other  problems  arise  for  words  like  kill 

throughout  this  paper,  the  abbreviations  C,  D,  E,  G,  and  S  will  be  used  to  stand 
for  Chinese,  Dutch,  English,  German,  and  Spanish,  respectively.  (Literal  translations  are 
included  for  the  non-English  cases. ) 
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which  have  subtle  related  meanings  (polysemy)  in  different  contexts  (e.g., 
kill  a  man  (itialai  i  vs.  kill  a  process  (acabar))  and  are  frequently  represented 
by  distinct  words  in  the  target  language. 

Semantic  ambiguity  has  often  been  considered  an  area  in  which  it  would 
be  too  difficult  to  provide  an  adequate  translation  without  access  to  some 
form  of  “deeper”  understanding,  at  least  of  the  sentence,  if  not  the  entire 
context  surrounding  the  sentence  [18],  [45].  The  following  well-known  exam¬ 
ples  illustrate  the  difficulty  of  semantic  ambiguity: 
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(4)  Complex  Semantic  Ambiguity 

(i)  Homography: 

E:  The  box  was  in  the  pen. 

S:  La  caja  estaba  en  el  corral  /  *la  pluma 

‘The  box  was  in  the  pen  (enclosure)  /  *pen  (writing)’ 

(ii)  Metonymy: 

E:  While  driving,  John  swerved  and  hit  a  tree. 

S:  Mientras  que  John  estaba  manejando,  se  desvio  y  golpeo  con 
un  arbol. 

‘While  John  was  driving,  (itself)  swerved  and  hit  with  a  tree’ 

In  (4i),  the  system  must  determine  that  the  pen  is  not  a  writing  implement 
but  some  sort  of  enclosed  space  ( i.e. ,  a  play  pen  or  a  pig  pen)  (homography 
resolution).  In  (4ii),  the  system  must  determine  that  it  is  John  who  is  driving 
but  John’s  car  that  hit  the  tree  (metonymy  resolution). 

However,  according  to  Bennett  [25],  the  sort  of  ambiguity  represented 
by  these  examples  rarely  arises  in  texts  to  which  MT  is  typically  applied; 
he  argues  that  contextual  ambiguity  occuring  in  routinely  translated  texts 
( e.g. ,  computer  manuals)  is  often  easily  resolved  by  means  of  a  simple  feature- 
based  approach.  Consider  the  following  examples: 

(5)  Contextual  Ambiguity 

(i)  E:  The  computer  outputs  the  data;  it  is  fast 

S:  La  computadora  imprime  los  datos;  es  rapida 
‘The  computer  outputs  the  data;  (it)  is  rapid’ 

(ii)  E:  The  computer  outputs  the  data;  it  is  stored  in  ascii 

S:  La  computadora  imprime  los  datos;  estan  almacenados  en  ascii 
‘The  computer  outputs  the  data;  (they)  are  stored  in  ascii’ 

In  the  context  of  a  computer  manual,  determining  the  appropriate  antecedent 
for  the  word  it  can  be  solved  by  distinguishing  between  storable  objects  and 
non-storable  objects  (storable  ±)  and  between  objects  with  a  speed  attribute 
and  those  without  (speed  fast/slow).  Note  that,  although  a  computer  is  a 
storable  object  in  other  contexts,  we  can  view  it  as  a  non-storable  object  in 
the  limited  domain  of  a  computer  manual. 

More  difficult  ambiguities  arise  in  translations  that  are  truly  ambiguous 
without  extensive  contextual  cues,  i.e.,  those  that  require  discourse  or  prag¬ 
matic  knowledge  for  correct  interpretation.  An  effective  discourse  analysis 
would  recognize  themes  and  theme  shifts  in  the  text  surrounding  a  sentence. 
As  a  simple  example,  consider  the  ambiguity  in  the  following  sentence: 

(6)  Complex  Contextual  Ambiguity 

E:  John  hit  the  dog  with  a  stick 

S:  John  golpeo  el  perro  con  el  palo  /  que  tenia  el  palo 
‘John  hit  the  dog  with  the  stick  /  that  had  the  stick’ 


This  ambiguity  could  be  resolved  by  remembering  from  the  earlier  text  that 
John  was  carrying  a  stick  to  protect  himself  (and  not  that  there  were  several 
dogs,  one  of  which  had  a  stick).  Pragmatic  analysis  deals  with  the  intentions 
of  the  author  in  affecting  the  audience.  This  is  as  important  for  language 
generation  (to  be  discussed  next)  as  it  is  for  language  understanding.  In 
particular,  the  author’s  intentions  affect  the  choice  of  words  and  how  they 
are  realized  (e.g.,  the  use  of  active  rather  than  passive  voice  to  emphasize  ur¬ 
gency).  Together,  discourse  knowledge  and  pragmatic  knowledge  are  useful 
in  resolving  many  types  of  ambiguities. 

A  second  type  of  linguistic  problem  for  MT  is  that  of  language  gen¬ 
eration.  Most  MT  researchers  are  of  the  opinion  that  generation  of  the 
target-language  sentence  does  not  require  a  full  language  generation  capa¬ 
bility,  i.e.,  it  is  not  necessary  to  fully  plan  the  content  and  organization 
of  the  text;  this  is  because  the  source-language  text  provides  much  of  the 
information  that  will  appear  on  the  surface  in  the  target  language.  How¬ 
ever,  generation  for  MT  is  a  non-trivial  exercise  since  it  is  often  difficult  to 
choose  the  words  that  adequately  convey  the  conceptual  knowledge  behind 
the  source-language  sentence.  This  is  the  lexical  selection  problem. 

Some  simple  examples  for  Spanish  and  English  are  given  here: 

(7)  Lexical  Selection 

(i)  S:  esperar 

E:  wait,  hope 

(ii)  G:  konnen 

E:  know,  understand 

(iii)  E:  be 

S:  ser,  estar 

(iv)  E:  fish 

S:  pez,  pescado 

Assume,  for  the  sake  of  the  current  discussion,  that  a  MT  generator  must 
select  the  appropriate  target-language  words  from  general  notions  such  as 
expect ,  have  knowledge  of ,  be ,  and  fish ,  respectively.  In  the  above  exam¬ 
ple,  additional  information  is  required  for  choosing  the  relevant  term  from 
each  target-language  pair.  A  possible  scheme  would  be  to  use  distinguishing 
features,  e.g.,  idesire,  ±fact,  ipermanent,  and  iedible,  respectively. 

Further  problems  arise  for  MT  generation  in  cases  where  linguistic  infor¬ 
mation  required  in  the  target  language  is  not  explicit  in  the  source  language 
sentence.  Consider  the  following  example: 

(8)  Tense  Generation 

C:  Wo  bei  Hangzhou  de  fengjlng  xlyinzhu  le 

E:  I  was  captivated  by  the  scenery  of  Hangchow 

E:  I  am  captivated  by  the  scenery  of  Hangchow 

In  this  example,  two  different  English  sentences  might  be  generated  from 
the  Chinese.  This  is  because  tense  information  (past,  present,  future)  is 
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not  overt  in  the  source- language  text.  The  information  used  to  select  the 
target-language  tense  depends  entirely  on  the  context  of  the  utterance.  For 
example,  the  second  sentence  would  be  generated  if  the  speaker  is  looking 
at  the  scenery  at  the  time  of  speech.2. 

The  generation  of  tense  is  problematic  in  other  languages  as  well.  In 
Spanish  there  is  a  distinction  made  between  simple  past  (preterit)  and  the 
ongoing  past  (imperfect).  This  type  of  distinction  is  not  made  explicitly  in 
English.  Consider  the  following  example: 

(9)  Tense  Generation 

(i)  E:  Mary  went  to  Mexico.  During  her  stay  she  learned  Spanish. 

S:  Mary  iba  a  Mexico.  Durante  su  visita,  aprendio  espanol. 

(ii)  E:  Mary  went  to  Mexico.  When  she  returned  she  started  to  speak 

Spanish. 

S:  Mary  fue  a  Mexico.  Cuando  regreso,  comenzo  a  hablar  espanol. 

In  the  first  example,  went  is  translated  as  the  Spanish  imperfect  past  since 
the  sentence  that  follows  is  an  elaboration,  making  went  stative.  In  the 
second  example,  went  is  translated  as  a  preterit  past  since  the  sentence 
that  follows  does  not  elaborate  the  visit  to  Mexico.  (For  a  discussion  about 
analogous  examples  in  French,  see  [76].) 

As  we  will  see  in  section  3.2,  the  problems  of  understanding  and  genera¬ 
tion  in  MT  are  often  addressed  by  restricting  the  domain  of  the  text  so  that 
the  lexicon  and  grammar  are  constrained. 

A  third  type  of  linguistic  problem  for  MT  concerns  the  mappings  be¬ 
tween  source-  and  target-language  representations.  There  are  a  number  of 
dimensions  along  which  source-  and  target-language  representations  may 
vary.  These  divergences  make  the  straightforward  mapping  between  lan¬ 
guages  impractical.  Some  examples  of  divergence  types  that  MT  researchers 
strive  to  address  are  thematic ,  head-switching ,  structural,  categorial ,  and 
conflational .3  Each  of  these  will  be  discussed,  in  turn. 

Thematic  divergence  involves  a  “swap”  of  the  subject  and  object  posi¬ 
tions: 

(10)  Thematic  divergence 

E:  I  like  Mary 

S:  Mary  me  gust  a 

‘Mary  (to)  me  pleases’ 

Here,  Mary  appears  in  object  position  in  English  and  in  subject  position  in 
Spanish;  analogously,  the  subject  /  appears  as  the  object  me. 

Head- switching  divergences  occur  commonly  across  language  pairs.  In 
such  cases,  a  main  verb  in  the  source  language  is  subordinated  in  the  target 
language: 

2 This  example  is  based  on  personal  communication  with  Qu  Yan 

’  Many  sentences  may  fit  into  these  divergence  classes,  not  just  the  ones  listed  here. 
Also,  a  single  sentence  may  exhibit  any  or  all  of  these  divergences. 
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(11)  Head-switching  divergence 

E:  I  like  to  eat 
G:  Ich  esse  gern 
‘I  eat  likingly’ 

Observe  that  the  word  like  is  realized  as  a  main  verb  in  English  but  as  an 
adverbial  modifier  ( gern )  in  German. 

In  structural  divergence,  a  verbal  argument  has  a  different  syntactic  re¬ 
alization  in  the  target  language: 

(12)  Structural  divergence 

E:  John  entered  the  house 
S:  Juan  entro  en  la  casa 

‘John  entered  in  the  house’ 

In  this  example,  the  verbal  object  is  realized  as  a  noun  phrase  (the  house ) 
in  English  and  as  a  prepositional  phrase  (en  la  casa)  in  Spanish. 

Categorial  divergence  involves  the  selection  of  a  target -language  word 
that  is  a  categorial  variant  of  the  source-language  equivalent.  In  such  cases, 
the  main  verb  often  changes  as  well: 

(13)  Categorial  divergence 

E:  I  am  hungry 
G:  Ich  habe  Hunger 
‘I  have  hunger’ 

In  this  example,  the  predicate  is  adjectival  ( hungry )  in  English  but  nominal 
(Hunger)  in  German.  Note  that  this  change  in  category  forces  the  generator 
to  select  a  different  main  verb. 

Conflation  is  the  incorporation  of  necessary  participants  (or  arguments) 
of  a  given  action.  A  conflational  divergence  arises  when  there  is  a  difference 
in  incorporation  properties  between  the  two  languages: 

(14)  Conflational  divergence 

E:  I  stabbed  John 
S:  Yo  le  di  punaladas  a  Juan 
‘I  gave  knife- wounds  to  John’ 

This  example  illustrates  the  conflation  of  a  constituent  in  English  that  must 
be  overtly  realized  in  Spanish:  the  effect  of  the  action  (knife- wounds)  is 
indicated  by  the  word  punaladas  whereas  this  information  is  incorporated 
into  the  main  verb  in  the  source  language. 

Resolution  of  cross-language  divergences  is  an  area  where  the  differ¬ 
ences  in  MT  architecture  are  most  crucial.  Many  MT  approaches  resolve 
such  divergences  by  means  of  construction-specific  rules  that  map  from  the 
predicate-argument  structure  of  one  language  into  that  of  another.  More  re¬ 
cent  approaches  use  an  intermediate,  language-independent  representation 
to  describe  the  underlying  meaning  of  the  source  language  prior  to  generat¬ 
ing  the  target  language.  The  details  of  these  contrasting  approaches  will  be 
discussed  further  in  section  4. 
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3.2  Operational  Considerations 

In  addition  to  the  above  linguistic  challenges,  there  are  several  operational 
challenges.  These  include:  extension  of  the  MT  system  to  handle  new  do¬ 
mains  and  languages;  handling  a  wide  range  of  text  styles;  maintenance  of 
a  system  once  it  has  been  developed;  integration  with  other  user  software; 
and  evaluation  metrics  for  testing  the  effectiveness  of  the  system. 

Typically,  to  handle  the  linguistic  challenges  associated  with  understand¬ 
ing  or  generating  a  text,  the  text  is  restricted  by  domain  so  that  the  lexicon 
and  grammar  are  more  restricted.  By  doing  so,  the  problems  of  lexical 
ambiguity,  homography,  polysemy,  metonymy,  contextual  ambiguity,  lexical 
selection,  and  tense  generation  are  reduced.  Then  when  building  or  extend¬ 
ing  a  MT  system  to  handle  a  particular  domain  and  language,  the  designer 
must  take  on  the  smaller  but  still  expensive  task  of  acquiring  and  adapting 
the  lexicon.  To  give  an  idea  of  the  size  of  a  domain  lexicon,  we  have  seen 
domain  lexicons  in  commerical  MT  systems  ranging  from  around  8000  to 
12000  entries.  The  lexicon  size  varies  according  to  the  domain  and  whether 
an  entry  represents  multiple  senses  or  a  single  sense. 

Although  several  researchers  have  developed  tools  to  help  with  the  ac¬ 
quisition  of  the  lexicon  (see,  e.g.,  [28],  [31],  [32],  [34],  [36],  [43],  [54],  [69], 
[74],  [81],  [89],  [132],  [145],  [165],  [168],  [171],  [170],  [210],  [217],  [225]  [229]) 
these  tools  only  help  reduce  the  overall  amount  of  work  that  is  required  by  a 
small  amount.  The  majority  of  the  work  still  requires  manual  entry  and  line¬ 
tuning  by  people  with  specialized  expertise  in  linguistics  and  in  the  domain 
for  which  the  system  is  being  built.  The  words  that  should  be  included  in 
the  system  can  be  extracted  from  a  representative  corpus  and  their  possible 
parts  of  speech  assigned.  However,  each  of  these  entries  must  be  reviewed 
to  correct  the  part  of  speech  assignments  since  the  automated  process  is  not 
100%  accurate.  In  addition,  the  entries  must  be  manually  modified  so  that 
other  linguistic  features  can  be  added. 

While  some  argue  that  one  would  want  to  manually  review  and  fine- 
tune  each  entry  anyway  [142],  the  expense  involved  depends  on  the  system 
architecture  and  research  paradigm  involved  ( i.e. ,  statistical-based  MT  sys¬ 
tems  do  not  require  detailed  linguistic  and  domain  knowledge).  For  systems 
that  require  large  amounts  of  encoded  knowledge,  research  is  in  progress  to 
automatically  extract  other  linguistic  features  from  published  bilingual  and 
monolingual  dictionaries  and  from  parallel  corpora  ([31],  [37],  [68],  [79],  [88], 
[149]). 

Another  issue  is  how  to  cost  effectively  maintain  a  lexicon  once  it  has  been 
acquired.  Most  interfaces  that  have  been  built  for  users  with  no  specialized 
linguistics  training  still  look  much  like  the  first  such  interface  created  for  the 
TEAM  project  ([93]).  (Other  lexical  interfaces  are  described  in  [17],  [20], 
[71],  [125],  [90],  [94],  [92],  [207],  [206].)  The  maintainer  is  presented  with 
various  sentences  utilizing  the  word  being  updated  and  asked  to  indicate 
which  usages  are  correct  and  which  are  not.  Each  sentence  represents  a  test 
to  determine  whether  or  not  a  particular  linguistic  feature  applies. 

One  problem  with  these  interfaces  is  that  asking  these  types  of  ques- 
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tions  does  not  work  for  all  words.  Someone  with  linguistic  expertise  will 
still  have  to  review  the  results  of  the  maintenance  session.  For  example, 
once  the  Spanish  verb  gustar  is  entered  into  the  lexicon  as  a  psyche  verb, 
someone  with  knowledge  about  linguistic  structure  must  check  that  the  ar¬ 
gument  structure  ( i.e. ,  ordering  of  subject  with  respect  to  the  object)  is  the 
reverse  of  the  argument  structure  for  the  analogous  English  word  like.  (Ex¬ 
ample  (10)  given  earlier  illustrates  this  thematic  divergence  between  Spanish 
and  English.) 

The  extension  of  a  system  to  handle  additional  languages  also  involves 
providing  an  analysis  grammar  for  the  source  language  and  a  generation 
grammar  for  the  target  language.  Creating  these  grammars  requires  spe¬ 
cialized  linguistics  knowledge  as  well  as  an  understanding  of  the  domain 
for  which  the  system  is  being  built  since  the  grammar  writer  usually  must 
understand  the  text  in  order  to  write  a  grammar  that  will  produce  an  appro¬ 
priate  analysis.  Grammar  writing  is  the  point  at  which  many  of  the  linguistic 
challenges  associated  with  understanding  and  generation  must  be  addressed. 
Heuristics  relevant  to  the  particular  domain  are  often  utilized  at  this  point. 
For  example,  in  Eurotra  preferences  for  PP  attachment  are  expressed  with 
heuristics  such  as:  “a  PP  which  is  not  a  modifier  is  preferred  over  the  same 
PP  when  it  is  a  modifier”  [23]. 

As  the  grammar  is  being  written  it  must  be  continually  tested  and  refined 
in  order  to  arrive  at  a  reasonably  good  result  for  most  of  the  expected  inputs. 
Herein  lies  two  major  challenges:  determining  what  a  reasonably  good  result 
is  and  predicting  the  most  likely  inputs.  Since  grammar  writing  requires  a 
great  deal  of  linguistic  expertise,  even  a  small  adjustment  to  the  grammar  is  a 
development  issue  and  not  a  user  maintenance  issue.  Even  as  a  development 
problem,  this  is  one  of  the  more  time  consuming  tasks  and  one  for  which  not 
many  tools  have  been  created. 

Another  operational  consideration  is  the  type  of  text  to  be  translated. 
Handling  a  wide  range  of  styles  and  sources  of  published  text  present  vastly 
different  degrees  of  operational  difficulties  for  MT  systems.  Literary  texts, 
such  as  novels  and  poetry,  make  frequent  use  of  metaphor,  have  complex 
and  unusual  sentence  structure,  and  assume  a  wide  world  and  social  context; 
these  are  all  outside  the  competence  of  current  MT  systems.  This  is  also 
true  of  popular  journalistic  texts,  which,  in  addition,  use  (or  create)  the  most 
fashionable  words  and  social  context.  The  problem  is  exacerbated  by  the  fact 
that  authors  of  these  texts  assume  their  audiences  are  knowledgeable  about 
the  general  world  and  in  some  cases  about  the  technical  held  underlying 
their  writings.  Often,  the  text  cannot  be  understood  without  this  type  of 
knowledge,  referred  to  as  world  knowledge . 

MT  systems  fail  for  texts  that  rely  heavily  on  metaphor  and  world  knowl¬ 
edge  because  they  have  great  difficulty  in  representing  and  using  complex 
and  subtle  metaphors  or  understanding  social  context  and  interactions,  and 
it  is  nearly  impossible  for  them  to  keep  up  with  the  rapid  changes  in  vocab¬ 
ulary.  MT  systems  work  best  for  texts  that  are  written  using  simple  syntax, 
make  little  or  predictable  use  of  metaphor,  and  have  a  stable  vocabulary 
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and  a  limited  domain.  Scientific  and  technical  documents  fall  into  this  cate¬ 
gory  and  thus  far  have  represented  the  most  successful  applications  of  MT. 
Text  fragments  such  as  tables  of  contents  and  sentence  fragments  present  a 
different  situation  in  that  the  syntactic  rules  must  be  relaxed  to  deal  with 
incomplete  sentences  and  possibly  ungrammatical  phrases.  Since  there  may 
be  little  context  as  a  basis  for  translating  the  fragments,  lexical  selection 
becomes  an  important  and  difficult  problem. 

Another  operational  consideration  is  the  necessity  for  the  MT  system  to 
be  designed  in  such  a  way  that  it  can  be  effectively  integrated  with  other 
user  software  such  as  OCR  and  document  publishing  tools.  An  application 
such  as  OCR  might  utilize  some  of  the  linguistic  information  that  is  available 
to  the  MT  system;  thus,  this  information  should  be  handled  in  such  a  way 
that  it  is  easily  retrievable  and  usable  independently  from  the  MT  system. 
Research  MT  systems  tend  to  be  modular  and  this  operational  consideration 
provides  further  motivation  and  challenges  in  designing  the  system. 

A  final  operational  consideration  is  how  to  evaluate  and  test  the  MT 
system.  This  applies  to  both  users  and  developers  of  systems.  When  a 
system  is  being  extended  or  when  a  purchase  is  being  considered,  there 
must  be  a  way  to  test  the  effectiveness  of  a  system  in  meeting  the  user’s 
requirements.  Further,  when  building  research  systems,  one  needs  to  be  able 
to  evaluate  the  effectiveness  of  the  approach.  As  mentioned  earlier  in  the 
above  discussion  on  grammar  writing,  predicting  the  inputs  is  a  challenge. 
In  the  case  of  evaluation,  the  question  is  whether  the  testing  adequately 
covers  the  possible  inputs  when  it  is  not  clear  what  all  the  inputs  will  be. 
A  second  difficulty  is  determining  the  correctness  of  a  translation.  The 
correctness  depends  on  the  intended  usage  of  the  translation.  Along  with 
this,  correctness  is  not  a  single  binary  judgement  but  a  set  judgements  which 
may  or  may  not  be  binary.  An  important  issue  in  evaluation  is  that  of 
choosing  the  appropriate  judgement  for  a  particular  use  of  translation.  We 
elaborate  on  these  evalution  issues  and  the  research  in  this  area  later  in 
section  6. 

4  Architectures 

Current  architectures  for  MT  may  be  roughly  organized  into  the  following 
three  classes:  (1)  Direct;  (2)  Transfer;  and  (3)  Interlingua.  These  levels  have 
been  characterized  in  terms  of  a  ‘pyramid’  diagram  (see  figure  3)  which  first 
appeared  (in  a  slightly  different  form)  in  [216]  and  has  since  become  classic. 
The  three  levels  correspond  to  different  levels  of  transfer,  depending  on  the 
depth  of  analysis  provided  by  the  system.  At  the  bottom  of  the  pyramid  is 
the  direct  approach  which  consists  of  the  most  primitive  form  of  transfer, 
i.e.,  word-for-word  replacement.  At  the  top  of  the  pyramid  is  the  interlingual 
approach  which  consists  of  the  most  degenerate  form  of  transfer,  i.e.,  the 
transfer  mapping  is  essentially  non-existent.  Most  translation  systems  fall 
somewhere  between  these  two  extremes  ranging  from  a  shallow  (syntactic) 
analysis  to  a  deeper  (semantic)  analysis.  We  will  discuss,  in  turn,  the  MT 
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architectures  corresponding  to  these  three  levels. 

4.1  Direct  Architecture 

The  result  of  a  direct  translation  architecture  is  a  string  of  target-language 
words  directly  replaced  from  the  words  of  the  source  language.  Generally 
the  word  order  of  the  target-language  text  is  the  same  as  that  of  the  source- 
language,  even  in  cases  where  the  target-language  does  not  permit  the  same 
word  order.  Unless  the  reader  has  a  good  knowledge  of  the  source-language 
structure,  this  text  can  be  very  difficult  to  understand. 

Some  systems  based  on  the  direct  architecture  recognize  special  source- 
language  syntactic  forms  and  reorder  the  words  to  acceptable  forms  in  the 
target  language.  These  syntactic  improvements  increase  the  readability  of 
the  target  text.  For  example,  a  direct  approach  might  handle  the  thematic 
divergence  of  example  (10)  given  earlier  by  means  of  a  rule  such  as  the 
following: 

(15)  X  LIKE  Y  —  Y  GUSTAR  X 

As  we  will  see  below,  such  rules  are  closer  in  nature  to  those  used  in  the 
transfer  approach.  Unfortunately,  without  a  detailed  syntactic  analysis,  only 
simple  forms  can  be  recognized;  consequently,  complex  structures,  such  as 
clauses  and  verb  separations  (as  are  frequently  found  in  German),  are  left 
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in  the  original  syntax.  Moreover,  when  more  difficult  cases  arise,  e.g.,  ex¬ 
ample  (11)  above,  it  is  impossible  to  construct  direct  mapping  rules.  The 
result  is  that  this  approach  typically  generates  very  literal  translations,  e.g., 
I  eat  likingly  for  the  German  sentence  Ich  esse  gem. 

A  more  serious  problem  with  systems  based  on  the  direct  architecture 
(as  well  as  with  some  versions  of  transfer  architecture  systems)  is  selection  of 
the  correct  target-language  words  for  source-language  words  (lexical  ambi¬ 
guity).  Recall  from  section  3.1  that  many  words,  such  as  ball ,  have  distinctly 
different  meanings  (homography)  and  translations;  others,  such  as  kill ,  have 
subtle  related  meanings  (polysemy)  that  are  frequently  represented  by  dis¬ 
tinct  words  in  the  target  language.  Direct  architecture  systems  cannot  cope 
with  this  lexical  selection  problem  since  they  cannot  relate  a  word  to  the  way 
it  is  used  in  a  sentence.  The  best  that  can  be  done  is  to  restrict  the  textual 
domain  and  include  in  the  lexicon  only  the  translation  most  likely  for  that 
domain.  Direct  architecture  systems  produce,  at  best,  poor  translations. 
However,  for  limited  domains  and  simple  text  (such  as  tables  of  contents  or 
text  fragments  where  correct  syntax  is  less  critical),  they  sometimes  produce 
translations  useful  to  domain  experts. 

4.2  Transfer  Architectures 

As  shown  in  figure  3,  transfer  architectures  he  on  a  spectrum  ranging  from 
direct  to  interlingual  architectures:  at  the  direct  architecture  end  of  the 
spectrum  is  the  syntactic  transfer  architecture;  at  the  interlingual  end  of 
the  spectrum  is  the  semantic  transfer  architecture.  The  initial  intent  of 
transfer  architecture  systems  was  to  provide  syntactically  correct  target- 
language  text  by  transforming  source-language  representations  into  suitable 
target-language  syntactic  representations.  Although  the  transfer  rules  that 
perform  this  conversion  depend  on  both  the  source  and  target  languages, 
some  of  the  rules  may  need  only  slight  modification  when  a  MT  system  is 
developed  for  a  new  target  language  linguistically  related  to  an  existing  one. 

Both  the  transfer  and  the  interlingual  approaches  require  “linking  rules” 
that  map  between  the  surface  (source-  and  target-language)  text  and  some 
form  of  internal  representation.  What  distinguishes  these  two  approaches  is 
that  the  internal  representations  used  in  the  transfer  approach  are  assumed 
to  vary  widely  from  language  to  language.  Thus,  transfer  rules  must  be 
constructed  to  map  between  these  two  representations.  As  we  will  see  in 
the  next  section,  no  transfer  rules  are  needed  in  the  interlingual  approach 
because  the  same  internal  representation  is  used  for  both  the  source  and 
target  languages. 

Although  the  internal  representations  used  for  the  source  and  target  lan¬ 
guages  are  not  the  same,  the  primitives  (e.g.,  SUBJ,  0BJ1)  used  in  these 
representations  are  often  similar,  or  even  identical.  The  use  of  similar  primi¬ 
tives  allows  for  more  general  mapping  rules  than  that  of  the  direct  approach. 
For  example,  the  rule  for  mapping  between  the  sentences  in  example  (10) 
would  be  more  general  than  the  analogous  rule  in  (15): 
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(16)  gustar(SlTBJ(ARGl:NP),0BJl(ARG2:PREP))  - 
like(  SUBJ(  ARG2:NP),0BJ1(  ARGENP )) 

The  effect  of  this  rule  is  to  swap  the  subject  and  object  arguments  and  to 
change  the  category  of  the  object  from  a  preposition  (in  Spanish)  to  a  noun 
phrase  (in  English). 

Unlike  the  direct  approach,  the  transfer  architecture  accommodates  more 
complex  mappings  such  as  that  of  example  (11).  (We  will  discuss  specific 
approaches  to  handling  such  cases  in  section  5.1.4.)  However,  a  common  crit¬ 
icism  of  this  approach  is  that  a  large  set  of  transfer  rules  must  be  constructed 
for  each  source-language/target-language  pair;  a  translation  system  that  ac¬ 
commodates  n  languages  requires  n2  sets  of  transfer  rules.  This  shortcoming 
has  been  noted  by  a  number  of  researchers  (see,  e.g.,  [24],  [65]). 

Despite  the  drawbacks  associated  with  the  use  of  transfer  rules,  the  syn¬ 
tactic  transfer  architecture  has  the  advantage  that  ambiguities  that  carry 
over  from  one  language  to  another  are  handled  with  minimal  effort.  Consider 
the  example  given  earlier:  John  hit  the  dog  with  a  stick.  In  this  example,  the 
syntactic  ambiguity  is  not  resolved  by  the  syntactic  analysis  because  there  is 
no  way  to  determine  from  syntax  alone  whether  the  dog  had  a  stick  or  John 
used  a  stick  to  hit  the  dog.  We  have  already  seen  that,  if  we  are  translating 
between  similar  languages,  it  may  not  be  necessary  to  resolve  the  ambiguity; 
the  source-language  syntactic  ambiguity  may  transfer  to  the  target  language 
and  still  be  understandable  to  human  readers.  In  an  attempt  to  improve 
performance,  some  syntactic  transfer  architecture  systems  take  advantage 
of  this  phenomenon  and  refrain  from  doing  a  complete  syntactic  analysis  of 
these  structures. 

Transfer  approaches  are  also  able  to  resolve  certain  lexical  ambiguities 
since  the  syntactic  analysis  can  usually  determine  the  lexical  category  (part 
of  speech)  of  a  source  text  word.  For  example,  as  mentioned  earlier,  it  is 
possible  to  determine  whether  the  English  word  book  would  be  translated  in 
Spanish  to  the  noun  libro  or  to  the  verb  reservar ,  depending  on  the  local 
context. 

The  overall  translation  quality  of  syntactic  transfer  architecture  systems 
tends  to  be  lower  than  those  that  employ  a  deeper  analysis  of  the  source- 
language  text.  Many  lexical  and  syntactic  ambiguities  are  not  resolvable; 
consequently,  long  and  complex  sentences  may  not  be  understandable.  In 
an  attempt  to  improve  translation  quality  by  considering  the  meaning  of  the 
sentences,  most  transfer  architecture  systems  have  moved  to  the  semantic 
transfer  end  of  the  spectrum  by  adding  semantic  analysis  and  semantic  trans¬ 
fer  rules  as  needed  ( i.e. ,  ambiguities  such  as  the  ball  and  with  a  stick  cases 
above  would  be  resolved).  The  result  of  this  combined  syntactic  and  seman¬ 
tic  analysis  is  a  representation  of  the  source  text  that  combines  translation¬ 
relevant  syntactic  and  semantic  information.  Since  this  is  usually  done  to 
solve  specific  language  pair  problems,  the  semantic  analysis  remains  incom¬ 
plete  and,  to  some  extent,  language  pair-dependent.  That  is,  the  addition  of 
a  new  target  language  may  well  require  modification  of  the  source-language 
semantic  analysis. 
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In  principle,  semantic  transfer  architecture  systems  have  the  capability 
to  produce  excellent  translations,  provided  that  a  context  (discourse  and 
pragmatic)  analysis  is  done  in  addition  to  a  deep  semantic  analysis.  In  prac¬ 
tice,  little  or  no  discourse  or  pragmatic  analysis  is  done,  and  only  enough 
semantic  analysis  is  done  to  meet  the  translation  goals  of  the  system.  Se¬ 
mantic  transfer  architecture  systems  can  produce  good  translations  when  the 
analysis  and  rules  are  complete,  and  the  bilingual  lexicon  covers  the  domain 
of  interest. 

A  perceived  difficulty  with  transfer  architecture  systems  is  that  the  trans¬ 
fer  rules  and,  to  some  extent,  the  source-language  analysis  are  dependent  on 
both  the  source  and  target  language.  Thus  a  new  system  would  have  to 
be  developed  for  each  language  pair  of  interest.  This  is  not  as  problematic 
as  might  be  expected.  First,  target-language  generation  can  be  expected 
to  need  little  augmentation  when  a  new  source  language  is  added.  Second, 
much  of  the  source-language  analysis  will  not  change  as  new  target  languages 
are  added;  only  newly  discovered  semantic  and  structural  differences  need 
be  resolved.  Finally,  it  is  true  that  new  transfer  rules  will  be  required.  How¬ 
ever,  the  addition  of  a  new  source  or  target  language  will  affect  only  the 
recognition  or  production  parts  of  the  rules,  respectively;  if  the  language  is 
being  replaced  by  one  similar  in  structure,  many  of  the  transfer  rules  need 
not  be  changed.  Of  course,  the  addition  of  radically  different  languages 
(e.g.  the  first  Asian  language  added  to  a  system  working  between  European 
languages)  will  require  a  major  effort. 

At  the  semantic-transfer  end  of  the  spectrum  there  is  a  final  category 
of  transfer  architecture  that  could  be  viewed  as  a  “special-case  interlingual” 
design,  i.e.,  one  that  defines  a  single  syntactic  and  semantic  representation 
for  several  related  languages,  such  as  the  Romance  languages.  This  approach 
is  termed  “multilingual”.  In  figure  3,  the  multilingual  representation  takes 
the  place  of  two  semantic  structure  nodes;  no  transfer  rules  are  necessary,  yet 
the  representation  is  not  interlingual  since,  as  in  standard  transfer  systems, 
it  relies  on  the  characteristics  of  the  source  and  target  languages.  In  this 
approach  the  analysis  and  generation  processes  depend  only  on  the  respective 
source  and  target  languages.  In  practice,  this  approach  is  being  exploited 
by  a  number  of  systems. 

To  summarize,  transfer  architecture  systems  produce  higher- quality  re¬ 
sults  than  direct  architecture  systems,  but  at  the  expense  of  having  to  de¬ 
velop  extensive  source-language  analysis  techniques  and  sets  of  transfer  rules. 

4.3  Interlingual  Architectures 

The  basic  idea  of  the  interlingual  ( sometimes  called  pivot )  architecture  for 
MT  is  that  the  analysis  of  the  source-language  text  should  result  in  a  rep¬ 
resentation  of  the  text  that  is  independent  of  the  source  language.  The 
target-language  text  is  then  generated  from  this  language-neutral,  interlin¬ 
gual  representation.  This  model  has  the  significant  advantage  that  analysis 
and  generation  development  need  be  done  only  once  for  each  language,  and  a 
translation  system  can  be  constructed  by  joining  the  analysis  and  generation 
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through  the  interlingual  representation. 

This  is  currently  a  very  active  area  of  research,  although  a  few  commercial 
systems  are  based  on  this  approach  [16],  [53],  [80],  [160],  [159].  The  research 
issues  center  on  the  feasibility  of  specifying  an  interlingua  that  is  adequate 
for  all  languages  and  on  the  depth  of  semantic  analysis  required  to  produce 
acceptable  translations.  The  latter  is  also  an  issue  for  the  more  ambitious 
systems  based  on  the  semantic  transfer  architecture. 

The  interlingual  approach  to  example  ( 10)  would  be  to  assume  that  there 
exists  a  single  underlying  concept  for  the  meaning  of  the  main  verb  in  both 
sentences,  i.e.,  a  representation  such  as  the  following: 

(17)  like/ gust ar:  [CAUSE  (X,  [BE  (Y,  [PLEASED])])]4 

This  representation  conveys  the  idea  that  something  or  someone  (X)  causes 
someone  (Y)  to  be  pleased.  An  approach  that  adopts  this  representation 
would  not  require  transfer  rules  since  the  representation  would  be  the  same 
for  the  source  and  target  languages.  Instead,  all  that  would  be  needed  is 
to  define  “linking  rules”  that  map  between  the  surface  (source-  and  target- 
language)  text  and  the  interlingual  form. 

An  issue  raised  with  respect  to  this  approach  is  that,  because  interlin¬ 
gual  representations  are  generally  independent  of  the  syntax  of  the  source 
text,  the  generation  of  the  target  language  text  from  this  representation  of¬ 
ten  takes  the  form  of  a  paraphrase  rather  than  translation  (see,  e.g.,  [10], 
[102],  [108].)  That  is,  the  style  and  emphasis  of  the  original  text  are  lost. 
However,  this  is  not  so  much  a  failure  of  the  interlingua  as  it  is  a  lack  of 
understanding  of  the  discourse  and  pragmatics  required  to  recognize  style 
and  emphasis.  In  some  cases,  it  may  be  an  advantage  to  ignore  the  author’s 
style.  Moreover,  many  have  argued  that,  outside  the  held  of  artistic  texts 
(poetry  and  fiction),  preservation  of  the  syntactic  form  of  the  source  text  in 
translation  is  completely  superfluous.  (See,  e.g.  [157],  [220].)  For  example, 
the  passive  voice  constructions  in  the  two  language  may  not  convey  identical 
meanings.  The  current  state  of  the  art  seems  to  be  that  it  is  possible  to  pro¬ 
duce  interlinguas  that  are  adequate  between  language  groups  (e.g.,  Japanese 
and  western  European)  for  specialized  domains. 

Another  issue  concerns  a  point  raised  earlier,  i.e.,  that  authors  of  source 
texts  assume  their  audiences  are  knowledgeable  about  the  general  world  and 
in  some  cases  about  the  technical  held  underlying  their  writings.  Many 
researchers  (e.g.,  [155])  who  adopt  the  interlingual  approach  aim  to  em¬ 
ploy  a  deep  semantic  analysis  that  requires  extensive  world  knowledge;  the 
performance  of  deep  semantic  analysis  (if  required)  depends  on  the  (so  far 
unproven)  feasibility  of  representing,  collecting,  and  efficiently  storing  large 
amounts  of  world  and  domain  knowledge.  This  problem  consumes  extensive 
efforts  in  the  broader  held  of  artihcial  intelligence. 

4This  is  a  simplified,  generic  version  of  a  representation  that  could  be  attributed  to  a 
number  of  researchers  including  Schank  ([182],  [183],  [184])  and  Jackendoff  ([105],  [106]) 
among  others.  See  [65]  for  a  more  detailed  treatment  of  such  cases. 
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5  Paradigms  of  MT  Research  Systems 

The  architectural  basis  of  the  system  is  only  one  of  many  axes  along  which 
one  might  compare  MT  systems.  Another  important  axis  of  comparison 
is  that  of  research  paradigm.  It  is  important  to  understand  the  difference 
between  the  type  of  architecture  and  the  type  of  paradigm:  one  does  not 
presuppose  the  other.  The  former  refers  to  the  actual  processing  design  ( i.e. , 
direct,  transfer,  interlingual),  whereas  the  latter  refers  to  informational  com¬ 
ponents  that  aid  the  processing  design  (knowledge-based,  example  based, 
statistics-based,  etc.). 

This  section  enumerates  and  discusses  some  of  the  more  recent  classes  of 
MT  paradigms  that  researchers  are  currently  investigating.  This  list  is,  by 
no  means,  exhaustive.  It  is  intended  to  cover  most  of  the  approaches  that 
have  been  covered  in  recent  years,  a  vast  majority  of  which  were  reported  in 
the  last  five  years  at  a  number  of  conferences  including  the  Annual  Meeting 
of  the  Association  for  Computational  Linguistics  (ACL),  the  International 
Conference  on  Theoretical  and  Methodological  Issue  in  Machine  Translation 
(TMI),  International  Conference  on  Computational  Linguistics  (COLING), 
and  MT  Summit  (MT-Summit). 

There  may  be  some  disagreement  about  the  boundaries  of  the  classifica¬ 
tion.  For  example,  the  S&BMT  approach  has  been  viewed  as  a  Constraint- 
Based  (CBMT)  approach  (see,  e.g.,  [221],  [222])  in  that  the  translation  pro¬ 
cess  is  taken  to  be  a  collection  and  resolution  of  sets  of  constraints.  It  has 
also  been  viewed  as  a  lexical-based  (LBMT)  approach  (see  e.g.,  [21],  [22])  in 
that  a  bilingual  lexicon  is  used  to  put  into  correspondence  pairs  of  monolin¬ 
gual  lexical  entries.  Frequently,  researchers  employ  techniques  from  several 
categories.  An  example  of  such  a  case  is  an  approach  described  in  [91]  which 
proposes  to  combine  techniques  used  by  Example-Based  (EBMT),  Statistics- 
Based  (SBMT),  and  Rule-Based  (RBMT). 

We  will  discuss  research  paradigms  in  terms  of  three  different  categories: 
( 1)  those  that  propose  to  rely  most  heavily  on  linguistic  techniques;  (2)  those 
that  do  not  use  any  linguistic  techniques;  and  (3)  those  that  use  a  combina¬ 
tion  of  the  two.  The  separation  of  linguistics-based  and  non- linguistics-based 
approaches  illustrates  an  emerging  dichotomy  among  MT  researchers  that 
first  became  evident  at  the  TMI  in  1992.  This  is  the  confrontation  dubbed 
the  ‘rationalist-empiricist’  debate,  which  divides  researchers  into  two  groups, 
those  who  advocate  well-established  methods  of  rule-based/constraint-based 
MT  (linguistic-based  MT)  and  those  involved  in  newer  corpus-based  MT  (in¬ 
cluding  EBMT,  SBMT,  and  Neural  Network  Based  (NBMT ) ).  Many  of  these 
same  issues  have  continued  as  hot  topics  of  debate  during  the  TMI  in  1992. 
Several  researchers  have  now  acknowledged  the  need  for  a  hybrid  or  inte¬ 
grated  approach  to  MT  that  makes  use  of  techniques  from  both  types  of 
paradigms,  combining  the  best  that  each  paradigm  type  has  to  offer.  For 
convenience,  an  index  of  the  approaches  discussed  here  is  given  (in  alpha¬ 
betical  order  by  author)  in  appendix  A. 


20 


5.1  Linguistic-Based  Paradigms 

Until  very  recently,  most  MT  researchers  studied  Linguistic-based  MT,  i.e., 
translation  on  the  basis  of  principles  that  are  well-grounded  in  linguistic 
theory.  Systems  based  on  linguistic  theory  strive  to  use  the  constraints  of 
syntax,  lexicon,  and  semantics  to  produce  an  appropriate  target-language 
realization  of  the  source-language  sentence.  This  section  presents  several  of 
the  most  recent  linguistic-based  paradigms. 

5.1.1  Constraint-Based  MT 

Constraint-Based  ( CBMT)  techniques  have  shown  up  in  several  different  MT 
approaches  (see,  e.g.,  [13],  [76],  [115],  [116],  [176],  [221],  [222]).  For  example, 
the  Shake- and- Bake  approach  (discussed  below)  demonstrates  the  full  utility 
of  constraint  application. 

In  this  section,  we  will  discuss  one  of  the  earliest  MT  approaches  to  use 
constraints  on  combination  of  lexical  items,  i.e.,  the  LFG-MT  system  [115], 
[116].  This  system  translates  English,  French,  and  German  bidirectionally 
based  on  lexical  functional  grammar  (LFG)  [114].  In  the  LFG  formalism, 
f-structure  (functional  structure)  is  a  fundamental  component  of  the  trans¬ 
lation.  For  example,  the  f-structure  for  the  sentence  I  gave  a  doll  to  Mary 
is: 


The  LFG-MT  system  is  capable  of  handling  difficult  translation  cases 
such  as  the  following: 

(19)  Promotional  divergence: 

E:  The  baby  just  fell  =>■  F:  Le  bebe  vient  de  tomber 

‘The  baby  just  (verb-past)  of  fall’ 

Here,  the  English  just  is  translated  as  the  French  main  verb  venir  which 
takes  the  falling  event  as  its  complement  de  tomber.  The  f-structures  that 
correspond,  respectively,  to  the  English  and  French  sentences  in  this  example 
are  the  following: 


(20)  (i) 
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Because  the  LFG-MT  system  is  based  on  construction-specific  represen¬ 
tations,  the  mapping  operations  required  in  the  transfer  must  be  performed 
by  transfer  equations  that  relate  source-  and  target-language  f-structures. 
The  transfer  equations  that  relate  the  f-structures  (20) (i)  and  (ii)  are  the 
following: 

(21)  (r  t  PRED  ‘JUST((  |  ARG))’)  =  VENIR 
(r  t  XCOMP)  ....  r  (  |  ARG) 

This  equation  identifies  venir  as  the  corresponding  French  predicate,  and  it 
constrains  the  argument  of  just  to  be  a  complement  that  is  headed  by  the 
prepositional  complementizer  de. 

As  illustrated  here,  the  LFG-MT  framework  makes  an  association  be¬ 
tween  the  syntactic  structure  and  the  f-structure  using  a  set  of  mediating 
selectional  constraints  that  are  encoded  as  lexical  entries.  The  disadvantage 
to  this  approach  is  that  the  f-structure  is  tightly  coupled  with  the  syntactic 
structure  of  the  language;  thus,  if  a  particular  concept  can  be  syntactically 
expressed  in  more  than  one  way,  there  will  be  more  than  one  f-structure  in 
this  framework. 

A  more  serious  flaw  of  the  LFG-MT  system  concerns  the  handling  of 
cases  like  (19)  in  the  context  of  embedded  clauses.  (For  additional  discus¬ 
sion,  see  [177].)  In  particular,  if  the  English  sentence  in  example  (19)  were 
realized  as  an  embedded  complement  such  as  I  think  that  the  baby  just  fell ,  it 
would  not  be  possible  to  generate  the  French  output.  The  reason  for  this  is 
that  the  LFG-MT  system  breaks  this  sentence  down  into  predicate-argument 
relations  that  conform  (roughly)  to  the  following  logical  specification: 
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(22)  think(I,fall(baby)) 
just  (fall(  baby)) 

The  problem  is  that  the  logical  constituent  fall(baby)  is  predicated  of  two 
logical  heads,  think  and  just.  The  LFG-MT  generator  is  unable  to  determine 
how  to  compose  these  concepts  and  produce  an  output  string. 

More  recently,  this  difficulty  has  been  addressed  in  [116]  where  a  new 
description-language  operator,  restriction ,  is  used  to  provide  a  more  ad¬ 
equate  account  of  head- switching.  A  refined  version  of  this  approach  is 
described  in  [59]  and  currently  used  in  the  Verbmobil  MT  project  [58]. 

5.1.2  Dialogue-Based  MT 

[IN  PREPARATION] 

5.1.3  Knowledge-Based  MT 

[IN  PREPARATION] 

5.1.4  Lexical-Based  MT 

Lexical-Based  MT  (LBMT)5  overlaps  heavily  with  several  other  approaches 
including  RBMT  ([9],  [15],  [12],  [10]),  PBMT  ([61],  [62],  [64]),  and  S&BMT 
([21],  [22]);  in  general,  a  lexical-based  system  refers  to  any  system  that  sup¬ 
plies  rules  for  relating  the  lexical  entries  of  one  language  to  the  lexical  en¬ 
tries  of  another  language.  Several  researchers  have  adopted  the  lexical-based 
paradigm,  but  at  different  degrees  of  generality.  (See,  e.g.,  [1],  [30],  [60],  [63], 
[65],  [75],  [78],  [79],  [80],  [81],  [179],  [209],  [211],  [225].) 

One  such  system  is  the  LTAG  system  [1]  for  English- French  and  French- 
English.  The  system  is  a  transfer  approach  that  uses  synchronous  tree- 
adjoining  grammars  (as  described  in  [192])  to  map  shallow  tree- adjoining 
grammar  (TAG)  [113]  derivations  from  one  language  onto  another.  The 
mapping  is  performed  by  means  of  a  bilingual  lexicon  which  directly  asso¬ 
ciates  source  and  target  trees  through  links  between  lexical  items  and  their 
arguments.  Roughly,  each  bilingual  entry  contains  a  mapping  between  a 
source-language  sentence  and  a  target-language  sentence. 

This  approach  handles  cases  such  as  the  following: 

(23)  Categorial  divergence: 

E:  John  is  fond  of  music  F:  John  aime  la  musique 

‘John  loves  the  music’ 

Here,  the  source  language  concept  is  realized  as  the  adjectival  form  be  fond 
of  in  English,  whereas  the  French  translation  realizes  this  concept  as  the 
verb  aimer.  The  transfer  rule  that  accounts  for  this  mapping  directly  links 

5 The  acronym  LBMT  has  also  been  used  for  linguistic-based  MT,  which  is  a  more 
general  term  that  refers  to  approaches  that  belong  to  any  category  outside  of  EBMT, 
NBMT,  or  SBMT.  It  is  used  in  a  more  specific  sense  here,  i.e.,  it  refers  to  those  linguistic- 
based  systems  that  are  driven  primarily  by  the  lexicon. 
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Figure  4:  Mapping  Source- Language  Trees  to  Target-Language  Trees  in 
LTAG 

the  adjectival  phrase  fond  of  in  the  source-language  tree  with  the  verb  aimer 
in  the  target-language  tree  as  shown  in  figure  4.  This  translation  mapping 
relates  the  AP  node  in  the  English  tree  to  the  V  node  of  the  French  tree. 

The  advantage  of  this  approach  is  that  it  accommodates  modifying  phrases 
in  cases  such  as  the  following: 

(24)  Categorial  divergence: 

E:  John  is  very  fond  of  music  F:  John  aime  beaucoup  la  musique 

‘John  loves  very  much  the  music’ 

Here,  the  English  adverb  very  is  associated  with  the  predicate  fond  of  (in¬ 
stead  of  with  the  main  verb)  whereas  in  French,  the  corresponding  adverbial 
beaucoup  is  associated  with  the  main  verb  aimer.  The  mechanism  that  per¬ 
mits  this  modification  to  be  appropriately  executed  is  the  linking  between 
the  adjectival  phrase  fond  of  and  the  verb  aimer:  since  the  English  main 
verb  be  has  no  link  associated  with  it,  the  modifier  must  instead  be  associ¬ 
ated  with  the  adjectival  phrase. 

One  disadvantage  to  this  approach  is  that  it  requires  entire  trees  to 
be  stored  in  the  transfer  dictionary  for  each  source-to-target  pair.  This  is 
significantly  burdensome  as  the  number  of  source  and  target  languages  begin 
to  add  up. 

5.1.5  Principle-Based  MT 

[IN  PREPARATION] 

5.1.6  Rule-Based  MT 

The  Rule-Based  MT  (RBMT)  paradigm  is  associated  with  systems  that 
rely  on  different  linguistic  levels  of  rules  for  translation  between  the  source 
and  target  language  [9],  [15],  [12],  [10],  [55],  [84],  [91],  [115],  [116],  [129], 
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[140],  [213],  [162],  [174],  [208].  The  prototypical  example  of  such  a  system 
is  Rosetta  [174]  which  divides  translation  rules  into  two  categories:  (1)  S- 
rules  which  are  “non- meaningful  rules”  that  map  lexical  items  to  syntactic 
trees;  and  (2)  M-rules  which  are  “meaning-preserving  rules”  map  between 
syntactic  trees  to  underlying  meaning  structures. 

Rosetta’s  separation  of  meaning-preserving  and  non-meaningful  rules  is 
reminiscent  of  the  notion  of  relaxed  compositionality  in  transfer  systems 
such  as  Enrol ra  and  its  descendant,  MiMo  [15],  [12],  where  “regular”  rules 
are  used  for  compositional  phenomena  and  “exceptional”  rules  are  used  for 
non- compositional  phenomena.  (Not  surprisingly,  both  groups  address  the 
same  phenomena,  e.g.,  head- switching  operations  required  to  handle  cases 
such  as  (11)  above.)  In  fact,  the  input-output  behaviors  of  both  designs  are 
entirely  equivalent,  but  Rosetta  relies  on  an  interlingual  representation  for 
characterizing  the  meaning  of  the  input  expression  so  that  it  can  later  be 
int  er act i vely  dis  ambiguat  ed . 

Consider  the  case  of  categorial  divergence  in  which  two  phrasal  heads  are 
swapped: 

E:  Mary  happened  to  come 
D:  Mary  kwam  toevallig 
‘Mary  came  by  chance’ 

The  interlingual  representation  used  for  this  case  in  Rosetta  is  a  canon¬ 
ical  form  corresponding  to  ‘by-chance (Mary , come).’  Thus,  the  English 
syntactic  structure  parallels  the  canonical  form:  the  verb  construct  cor¬ 
responding  to  ‘by-chance’  (happen  to)  takes  as  its  argument  the  clause 
corresponding  to  ‘(Mary , come)’  (came).  The  Dutch  syntax,  on  the  other 
hand,  is  not  in  synch  with  the  canonical  form  since  the  verb  correspond¬ 
ing  to  ‘(come, Mary)’  (kwam)  takes  as  its  argument  the  adverbial  construct 
corresponding  to  ‘by-chance’  (toevallig  or  ‘by  chance’). 

In  order  to  handle  such  cases  compositionally,  a  “switch  rule”  is  invoked 
and  normal  processing  is  interrupted;  control  is  then  passed  to  a  module 
that  derives  a  new  category  (from  an  argument  of  the  canonical  head)  that 
takes  over  the  role  of  syntactic  head.  One  problem  with  this  approach  is  that 
it  leaves  open  the  question  of  how  grammar- driven  interrupts  interact  with 
idiosyncratic  requirements  of  individual  lexical  items.  In  the  case  above,  the 
non-head  constituent  toevallig  could  be  viewed  as  a  “deviant”  in  that  it  takes 
on  non-head  status  in  the  syntax  but  head  status  in  the  canonical  form.  It 
would  make  a  great  deal  of  sense  to  encode  such  information  in  the  lexicon 
so  that  it  would  not  be  the  case  that  every  Dutch  adverbial  (e.g.,  gisteren  or 
‘yesterday’)  triggers  a  grammar  interrupt;  only  certain  adverbials  would  act 
as  triggers,  namely  those  associated  with  a  lexical  marker  (e.g.,  toevallig). 

A  more  troubling  aspect  of  the  “switch-rule”  approach  is  that  it  is  dif¬ 
ficult,  or  perhaps  impossible,  to  accommodate  head-swapping  cases  where 
the  “deviant”  serves  as  a  head  in  the  syntax  but  a  non-head  in  the  canonical 
form.  Consider  the  following  example: 
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English:  Mary  usually  goes  to  school 

Spanish:  Mary  suele  ir  a  la  escuela 

‘Mary  is  accustomed  to  go  to  school’ 

In  this  example,  the  canonical  form  could,  arguably,  be  ‘go (Mary ,  school, usually ) 
In  this  case,  the  English  syntax  parallels  the  canonical  form.  By  con¬ 
trast,  the  Spanish  syntax  includes  the  “deviant”  verb  suele,  which  is  a 
head  in  the  syntax  but  a  non-head  in  the  canonical  form;  this  is  the  in¬ 
verse  of  the  previous  example.  The  head-swapping  in  Rosetta  do  not  in¬ 
clude  such  a  case,  perhaps  because  this  would  force  an  interrupt  to  occur 
too  late — after  the  syntactic  structure  corresponding  to  the  logical  head 
has  already  been  built.  Even  though  it  might  not  be  linguistically  justi¬ 
fied,  the  canonical  representation  for  the  Spanish  sentence  would  have  to 
be  ‘be-accustomed(Mary  ,go(school) ) .’  By  contrast,  the  English  sentence 
would  map  into  the  canonical  form  given  above  which  means  that  the  two 
would  never  be  translation  equivalents.  Instead  the  system  would  force  the 
following,  more  literal,  translation  pairs  (in  both  directions): 

English:  Mary  usually  goes  to  school 

Spanish:  Mary  usualmente  va  a  la  escuela 

English:  Mary  is  accustomed  to  going  to  school 

Spanish:  Mary  suele  ir  a  la  escuela 

Taking  a  grammar- driven  approach  forces  the  Rosetta  developers  to  re¬ 
gard  such  cases  of  mismatch  as  purely  grammatical.  The  possibility  of  ex¬ 
tending  the  notion  of  compositionality  into  the  lexicon  is  an  issue  that  would 
enhance  the  basic  design  of  Rosetta,  which  underlyingly  is  a  well- developed 
system  with  coverage  of  a  wide  range  of  linguistic  phenomena. 

5.1.7  Shake  and  Bake  MT 

One  of  the  newest  linguistic-based  translation  approaches  is  Shake  and  Bake 
MT  (S&BMT)  [21],  [22],  [35],  [221],  [222],  which  is  a  perfect  example  of  why 
we  distinguish  between  research  paradigm  and  MT  architecture.  Although 
the  originators  of  this  approach  claim  that  S&BMT  is  an  alternative  to 
the  transfer  architecture  (and  also  to  the  interlingual  architecture),  in  fact, 
transfer  rules  are  precisely  the  mechanism  through  which  the  translation 
is  achieved.  However,  while  the  mapping  between  lexical  items  is  achieved 
through  standard  transfer  rules,  the  algorithm  for  combining  these  items  to 
form  a  target-language  sentence  is  nonconvent ional. 

The  transfer  rules  are  defined  on  the  basis  of  "bilingual  lexical  entries” 
which  relate  monolingual  lexical  entries.  After  the  source-language  sentence 
is  parsed,  the  source-language  words  are  mapped  to  target-language  words  by 
means  of  the  bilingual  entries.  The  algorithm  used  for  combining  the  target 
language  words  attempts  to  order  the  words  based  on  syntactic  constraints 
of  the  target  language. 

The  S&BMT  approach  is  motivated  by  the  need  to  handle  complex  trans¬ 
lations  such  as  the  head-switching  case  given  above  in  section  3.1.  Unlike 
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the  transfer  approach,  the  S&BMT  algorithm  overcomes  the  difficulty  of 
constructing  non-compositional  mapping  rules  for  such  cases  by  selecting 
target-language  words  from  a  bilingual  lexicon  and  trying  different  order¬ 
ings  of  these  words  (the  ‘Shake’  of  S&BMT)  until  a  sentence  is  produced 
(the  ‘Bake’  of  S&BMT)  that  satisfies  all  syntactic  constraints.  Consider  the 
following  case  for  English/Dutch,  which  is  analogous  to  example  (11)  given 
earlier: 

(25)  E:  Jan  swemt  graag 

D:  John  enjoys  swimming 
‘John  swims  likingly’ 

The  statements  of  equivalence  in  the  bilingual  entries  for  the  words  used 
in  this  example  are  spelled  out  as  follows: 

(26)  (i)  XE  k  X^  =  X,) 

<XE  cite>  =  enjoy 

<Xg  cite>  =  prespart 

<XE  cite>  =  graag 

<XE  sem  index>  =  <XE  sem  index> 

<Xg  sem  index>  =  <XE  sem  index>  <XE  sem  exp  index>  = 
<XE  sem  exp  index> 

<XE  sem  obj  index>  =  <XE  sem  obj  index> 

(ii)  Ye  =  Yd 

<Ye  cite>  =  swim 

<  Y | )  cite>  =  zwemen 

<Ye  sem  index>  =  <YE  sem  index> 

<XE  sem  agt  index>  =  <XE  sem  agt  index> 

These  bilingual  rules  form  the  basis  of  the  transfer  between  the  English  lex¬ 
ical  entries  XE  (and  XT)  and  YE  to  the  Dutch  lexical  entries  XE  and  YE, 
respectively.  The  cite  feature  uniquely  picks  out  an  entry  in  the  dictionary. 
The  sem  feature  associates  a  semantic  representation  with  different  compo¬ 
nents  of  the  entry.  The  first  two  usages  of  sem  indicate  that  the  semantics 
of  enjoy  and  the  present  participial  (i.e.,  the  “-ing”  form  in  English)  are 
mapped  to  the  semantics  of  graag.  The  sem  feature  is  also  used  to  asso¬ 
ciate  thematic  relations  (i.e.,  experiencer  and  object  in  (26i);  and  agent 
in  ( 26ii) ) .  The  words  swim  and  zwemen  are  also  related  through  the  use  of 
the  sem  feature. 

One  point  to  note  here  is  that  the  two  lexical  entries  in  English  (XE  and 
Xg)  are  related  to  one  lexical  entry  in  Dutch  (XE).  The  relation  of  Xg  to 
XE  need  not  be  specified  in  the  bilingual  rules  since  this  information  can 
be  determined  from  grammatical  constraints  during  translation  (i.e.,  that 
the  present  participial  morpheme  “-ing”  must  be  associated  with  the  verbal 
argument  swim ,  not  with  the  main  verb  enjoy).  There  are  also  other  types 
of  information  inherent  in  the  translation  pairs  that  need  not  be  specified 
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in  the  bilingual  rules;  in  particular,  the  fact  that  the  equivalent  tense  mor¬ 
phemes  (pres)  occur  on  non- equivalent  stems  ( enjoy  and  zwemen )  follows 
immediately  from  the  mechanics  of  generation. 

The  idea  behind  this  approach  is  that,  once  the  bilingual  elements  cor¬ 
rectly  identify  the  indices  of  the  lexical  entries,  the  S&BMT  algorithm  has 
the  job  of  “combining”  of  these  elements.  Translation  equivalence  is  stated 
between  bags  of  lexical  constituents.  For  example,  the  full  bags  for  the 
source-  and  target-language  sentences  given  above  in  (25)  are  the  following: 

(27)  {jan, pres, zwemen, graag}  =  {john, pres, enjoy, prespart, swim} 

The  relation  between  the  constituents  in  those  bags  need  not  be  explicitly 
stated  in  the  lexicon  since  these  can  be  determined  from  the  grammatical 
restrictions  of  the  two  languages. 

The  benefit  to  this  design  is  that  the  bilingual  lexicographer  need  only 
specify  contrastive  knowledge  between  the  two  languages;  the  monolingual 
grammars  used  for  parsing  and  generation  take  care  of  the  rest.  The  cre¬ 
ators  of  this  design  have  proposed  that  the  bilingual  mappings  are  restricted 
enough  to  allow  for  the  possibility  of  automated  acquisition  of  bilingual  cor¬ 
respondences  from  aligned  corpora  (see  [22],  [222]). 

A  disadvantage  to  this  approach  is  that,  as  described  in  [35],  S&BMT 
generation  is  an  NP-complete  problem.  Thus,  there  is  no  tractable  gen¬ 
eral  algorithm  for  generating  within  the  S&BMT  framework.  However,  it  is 
possible  to  impose  restrictions  on  the  target-language  bag  which  forms  the 
input  to  generation.  For  example,  heuristic  control  might  be  provided  from 
the  structure  of  the  source  language.  Brew  has  shown  that  a  heuristic  ap¬ 
proach  based  on  constraint  propagation  provides  considerable  improvements 
in  practice.  In  addition,  refinements  have  been  proposed  in  [77,  p.  365]  for 
handling  translation  ambiguity. 

5.2  Non-Linguistic-Based  Paradigms 

In  the  past  few  years,  researchers  have  investigated  MT  paradigms  that  are 
not  based  on  linguistic  theories  or  even  linguistic  properties  of  language  have 
been  investigated.  This  investigation  has  been  made  possible  by  the  rapid 
advances  in  computational  power  and  the  availablility  of  machine  readable 
dictionaries  and  monolingual  and  bilingual  text  corpra.  The  approaches 
all  depend  on  the  existance  of  large  text  copora  which  are  used  either  for 
training  data  or  databases  of  existing  translations.  This  section  describes 
some  of  the  recent  research  in  non-linguistic-based  paradigms. 

5.2.1  Statistical-Based  MT  (SBMT) 

The  production  of  translations  based  on  statistical  prediction  techniques 
depends  heavily  on  statistical  analysis  of  bilingual  parallel  corpora.  While 
some  early  investigation  [117]  of  the  SBMT  approach  was  done,  the  modern 
efforts  were  initiated  by  IBM  in  1988  in  the  Candide  French- English  Machine 
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Translation  Project  [39],  [40],  [38],  [41].  Additional  investigations  of  SBMT 
have  been  reported  in  [49],  [57],  [91],  [127],  [137],  [158],  and  [203]. 

The  SBMT  approach  was  derived  from  speech  processing  techniques.  In 
particular,  a  variant  of  Bayes  Rule  is  used  to  show  that  the  probability 
that  a  string  of  words  (T)  is  a  translation  of  a  given  string  of  words  (S)  is 
proportional  to  the  product  of  the  probability  that  a  string  of  target  words 
is  a  legal  utterance  in  the  target  language  and  the  probability  that  a  string 
of  words  in  the  source  language  is  a  translation  of  the  string  of  words  in  the 
target  language.  That  is: 

(28)  P (TS)  P(T)  *  P(S— T)— 

If  the  right  hand  probabilities  are  known,  the  translation  is  obtained  by 
choosing  T  such  that  the  left  hand  probability  is  maximized.  Obviously, 
the  probabilities  for  all  strings  in  both  languages  cannot  be  known  and  con¬ 
sequently  must  be  estimated  for  the  approach  to  be  tractable.  The  usual 
approach  is  to  define  approximate  probalistic  models  constructed  from  prob¬ 
abilities  that  can  be  directly  estimated  from  existing  data. 

The  Candide  language  model  is  a  trigram  model  asserting  that  the  prob¬ 
ability  that  any  word  in  a  target  language  (English)  string  is  part  of  a  legal 
sentence  depends  only  on  the  two  previous  words.  Knowing  these  probabili¬ 
ties,  an  estimate  of  the  probability  that  a  string  of  words  is  a  legal  sentence 
is  the  product  of  all  the  trigrams  in  the  string.  The  trigram  probabilities 
can  be  estimated  by  counting  the  frequency  of  word  triples  in  a  large  corpus 
of  English  text.6 

The  probabalistic  model  underlying  the  Candide  system  assumes  that  the 
probabality  that  a  source  language  ( French)  word  is  a  translation  of  a  given 
English  word  depends  only  on  the  English  word.  A  single  translation  allows 
for  0  to  10  French  words.  It  is  also  assumed  that  the  English  equivalents  of 
the  French  words  might  be  ordered  differently  in  the  target  sentence.  The 
estimation  of  these  probabilities  is  considerably  more  difficult  than  for  the 
trigrams.  In  this  case  a  very  large  bilingual  parallel  corpus  is  required/  The 
problem  is  to  align  each  English  word  in  a  target  sentence  with  the  French 
equivalent (s)  in  the  source  sentence.8.  The  approach  is  to  assume  values 
for  the  alignment  probabilities  and  compute  the  transfer  probabilities  of  the 
sentence  pairs  in  the  corpus.  Depending  on  the  alignment  and  translation 
probabilities,  a  given  sentence  pair  may  have  several  transfer  probabilities. 
Each  occurence  of  the  alignments  and  translations  are  counted  and  weighted 
by  the  transfer  probability.  These  weighted  counts  are  used  to  make  a  new 
estimate  of  the  individual  probabilities  and  the  process  is  repeated.  This 
iterative  approach  converges  to  a  local  equilibrium  of  probabilities  and  is 
used  to  compute  P(ST) — ,9 

6 Of  course,  no  corpus  will  be  large  enough  to  contain  all  possible  triples  and  some 
smoothing  method  is  required  to  assign  a  (small)  probability  to  unseen  triples. 

'  Fortunately  French  and  English  versions  of  the  Canadian  parlimentary  proceedings 
(called  the  Hansards)  are  available. 

8 The  search  for  source  text  words  is  non-trivial.  See  [38]. 

9 The  details  of  this  process  are  outside  the  scope  of  this  paper.  See  [50].  Obviously 
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This  approach  has  been  tested  in  the  laboratory  for  French- English  MT 
and  produces  translations  approaching  the  quality  of  those  from  syntactic 
transfer  systems.  The  grand  claim  made  for  this  approach  is  that  no  lexicons 
or  grammars  are  used.  Everything  come  from  statistical  analysis  of  corpora. 
While  this  is  the  strength  of  this  approach,  it  is  also  the  weakness  since  the 
corpora  must  exist.  Also  the  translations  are  very  dependent  on  the  domain 
of  the  corpora.  An  even  more  serious  problem  is  that  the  only  way  to  improve 
the  quality  of  the  translation  is  to  improve  the  accuracy  of  the  probabalistic 
models  of  the  target  language  and  of  the  translation  process.  Unfortunately, 
this  would  add  many  more  parameters  to  the  millions  required  by  the  simple 
models  described  above.  Recognizing  this  difficulty,  IBM  has  applied  a  num¬ 
ber  of  techniques  from  classical  computational  linguistics  to  form  a  hybrid 
system.  Morphological  analysis,  part  of  speech  tagging,  syntactic  regulariza¬ 
tion,  limited  grammatical  analysis,  and  contextual  marking  are  used.  Most 
of  these  techniques  are  parameterized  with  the  values  derived  from  analysis 
of  the  corpora.  Some  improvement  was  achieved;  however,  they  were  unable 
to  match  the  best  of  the  commercial  MT  systems  [224], 

IBM  continued  to  work  on  this  system  until  1995  when  both  internal  and 
external  support  were  withdrawn.  Two  quotes  from  Yorick  Wilks  [224]  best 
summarize  the  impact  of  this  work: 

“Brown  et  al.’s  retreat  to  incorporating  symbolic  structures  shows 
the  pure  statistics  hypothesis  has  failed.” 

“Another  way  of  looking  at  this  is  how  much  good  IBM  is  doing 
us  all:  by  showing  us,  among  other  things,  that  we  have  not  spent 
enough  time  thinking  about  how  to  acquire,  in  as  automatic  a 
manner  as  possible,  the  lexicons  and  rule  bases  we  use.” 

Another  interesting  statistics-based  MT  approach  to  using  statistical 
techniques  to  generate  MT  systems  is  LINGSTAT  developed  by  Dragon  Sys¬ 
tems,  Inc.  ([230],  [231],  and  [19]).  This  work  started  in  1992  as  a  translation 
aid  as  a  direct  substitution  system  with  a  simple,  hand-generated  finite  state 
grammar  for  Japanese.  The  English  glosses  were  based  on  bilingual  dictio¬ 
naries  and  the  grammar  was  used  to  assign  Japanese  phrase  attachment. 
This  was  quickly  seen  to  be  unsatisfactory  and  a  number  of  statistical  steps 
were  taken  to  provide  improved  complete  translation.  The  finite  state  gram¬ 
mar  was  expanded  to  a  probabalistic  context  free  grammar  (PCFG)  and 
trained  on  Japanese  text.  The  PCFG  was  used  to  provide  a  gross  parse 
and  a  lexicalized  grammar  (also  trained  on  Japanese  text)  was  used  to  as¬ 
sist  with  attachments.  Hand-generated  reording  rules  were  provided  to  assist 
the  transfer  to  English.  Further,  a  trigram  probabalistic  language  model  was 
developed  for  English  to  assist  in  gloss  selection.  Some  improvement  was  ob¬ 
tained  with  these  changes.  These  techniques  were  ported  to  Spanish/English 
translations  with  somewhat  better  results  than  for  the  Japanese. 


smoothing  is  required. 
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The  LINGSTAT  project  ended  in  1995  when  support  was  withdrawn. 
Interestingly,  linguistic-based  extensions  were  planned  as  a  continuation  of 
this  work.  One  extension  involved  the  assignment  of  case  frame  categories  to 
the  source-language  verbs  in  order  to  improve  the  parse.  The  probabilities 
of  these  sub- categories  were  to  be  learned  by  iterative  parsing  of  source  text. 
Additional  extensions  involved  experimentation  with  extraction  of  phrase 
translations  from  parallel  bilingual  corpora.  While  the  LINGSTAT  group 
never  committed  to  “pure”  statistical  MT  as  did  the  Candide  group,  they 
were  strongly  committed  to  statistical  training  and  extraction  of  more  sym¬ 
bolic  approaches.  This  intersection  of  statistical  and  symbolic  paradigms  is 
relevant  to  the  hybrid  techniques  discussed  in  Section  5.3  below. 

5.2.2  Example-  (or  Case-/Memory-)  Based  MT  (EBMT) 

Example-Based  MT  (EBMT),  first  suggested  by  Nagao  [146],  emulates  hu¬ 
man  translation  practice  in  recognizing  the  similarity  of  a  new  source  lan¬ 
guage  sentence  or  phrase  to  a  previously  translated  item  and  using  this 
previous  translation  to  perform  “Translation  by  Analogy”.  Sato  and  Nagao 
[181]  implemented  an  experimental  EBMT  system  to  demonstrate  the  trans¬ 
lation  of  simple  Japanese  sentences  into  English.  Additional  investigations 
of  EBMT  have  been  reported  in  [85],  [91],  [110],  [138],  [141],  [154],  [158], 
[162],  [173],  [199],  [204],  [205],  [232], 

The  basic  idea  of  EBMT  assumes  a  data  base  of  parallel  translations 
which  is  searched  for  the  source  language  sentences  and  phrases  closest 
matching  a  new  source  language  sentence.  The  translations  of  the  matched 
phrases  are  then  modified  and  combined  to  form  a  transfer  translation  of  the 
new  sentence.  This  technique  is  quite  similar  to  Case  Based  Reasoning  used 
in  Artificial  Intelligence  (see  [126]).  A  simple  match  would  be  an  identical 
phrase  (especially  in  the  function  words)  except  for  a  similar  content  word. 
The  closeness  of  the  match  would  be  determined  by  the  semantic  “distance” 
between  the  two  content  words  as  measured  by  some  metric  based  on  a  the- 
sarus  or  ontology.  The  translation  would  be  the  substitution  of  a  translation 
of  the  different  word  in  the  translation  of  the  best  match. 

The  accuracy  and  quality  of  the  translation  depends  heavily  on  the  size 
and  coverage  of  the  parallel  data  base.  While  the  data  base  need  not  be  as 
large  as  required  for  SBMT  (since  the  full  vocabulary  need  not  be  covered), 
the  required  coverage  of  syntactic  and  semantic  divegences  results  in  a  size 
difficult  to  store  and  search.  Phrasal  matching  requires  at  least  a  rough  syn¬ 
tactic  analysis  of  the  parallel  translations  as  well  as  some  semantic  analysis 
to  determine  the  closeness  of  the  match  10 .  In  order  to  avoid  matching  im¬ 
proper  divergences,  Collins  and  Cunningham  [52]  weight  phrasal  translations 
by  their  frequency  in  the  database. 

Sentence  translation  in  EBMT  requires  that  in  addition  to  phrasal  match¬ 
ing  the  syntactic  structure  of  the  source  sentence  must  be  matched  with 

10Nirenburg,  Domashnev,  and  Grannes  [152]  argue  that  such  analysis  defeats  the  purpose 
of  EBMT  and  propose  substring  pattern  matching  using  synonyms  and  hyperonyms. 
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sentences  in  the  database.  While  full  sentence  matching  has  shown  some 
success  [86],  most  uses  of  EBMT  are  restricted  to  subproblems  such  as  func¬ 
tion  words  [205],  noun  phrases  [180],  and  prepositional  phrase  attachment 
[205]. 

5.2.3  Neural  Network  Based  MT  (NBMT) 

Experiments  have  been  done  with  neural  network  technology  for  such  MT 
functions  as  parsing  [107],  lexical  disambiguation  [95],  and  learning  of  gram¬ 
mar  rules.  The  incorporation  of  neural  networks  and  connectionist  ap¬ 
proaches  into  MT  systems  is  a  relatively  new  area  of  investigation  [103]. 
Most  recently,  Castano  et  al.  [48]  have  run  some  tests  with  very  small  vo¬ 
cabularies  (about  30  words)  and  simple  syntax.  Handling  large  vocabularies 
and  grammars  inflates  the  size  of  the  neural  networks  and  the  training  set 
and  time  dramatically.  In  addition,  dealing  with  word  sequences  requires 
an  explicit  representations  of  time,  further  complicating  the  neural  network 
representation.  McLean  [141]  uses  neural  nets  to  find  similar  sentences  in 
an  EBMT  system.  But  again,  a  small  vocabulary  (30  words)  and  short 
sentences  are  used.  It  is  not  clear  that  this  approach  can  be  extended  to 
existing  EBMT  systems.  In  contrast  with  the  other  approaches  described  in 
this  paper,  no  realistic  MT  Systems  have  been  built  based  solely  on  neural 
network  technology.  This  technology  is  thus  more  of  a  technique  than  a 
system  approach. 

5.3  Hybrid  Paradigms 

In  the  previous  section  on  non-linguistic-based  paradigms  it  was  mentioned 
that  many  of  those  paradigms  had  difficulty  with  some  aspects  of  the  MT 
process.  For  example:  SBMT  does  not  handle  long  range  contextual  depen¬ 
dencies  and  EBMT  has  difficulties  with  complex  sentence  structure.  It  was 
quickly  recognized  that  these  non-linguistic  paradigms  could  be  combined 
with  linguistic  paradigms  to  exploit  the  strenghs  of  each  [44]  [91]  [85]  [130] 
[154].  The  hybrid  paradigm  involves  a  mixing  of  MT  paradigms  (as  well  as 
mixing  of  the  MT  architectures).  The  usual  approach  is  to  use  linguistic 
methods  to  obtain  parses  of  the  source  text  and  to  use  statistical  or  example 
techniques  to  resolve  dependencies  and  phrasal  translations  [85].  Statistical 
trigram  target  language  models  have  been  used  for  lexical  selection  [42],  Sta¬ 
tistically  generated  decision  trees  have  been  used  to  insert  English  articles 
into  article  free  translations  of  Japanese  text  [124].  The  Pangloss  system 
[156]  is  a  hybrid  of  both  MT  paradigms  and  MT  architectures. 

6  Evaluation  of  MT  Systems 

The  evaluation  of  MT  systems  is  also  an  active  area  of  research.  Once  an  MT 
system  or  portion  of  an  MT  system  is  built,  how  does  one  evaluate  whether 
it  is  working  correctly  and  whether  it  is  a  promising  approach  with  which  to 
continue?  As  noted  in  Hutchins  [101],  it  is  clear  that  fully  automatic  high 
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quality  translation  is  no  longer  the  current  goal  of  most  MT  experts.  In  fact, 
it  is  expected  that  revision  is  required  for  all  translations,  whether  done  by 
humans  or  computers.  Thus,  in  order  to  decide  what  the  evaluation  criteria 
for  a  machine  translation  system  should  be,  we  must  first  determine  what 
the  intended  use  of  the  MT  output  will  be,  and  then  decide  whether  the 
output  is  satisfactory  for  this  purpose.  Hutchins  argues  that  “There  can  be 
valid  uses  of  poor  quality  output  in  unedited  form  if  it  is  produced  quickly, 
cheaply,  and  is  not  intended  for  publication.  If  better  quality  is  required 
then  collaboration  of  man  and  machine  is  essential. ” 

Given  that  “perfect”  translation  is  not  within  our  grasp  now,  if  ever,  we 
still  need  to  decide  how  we  can  judge  whether  the  output  is  high  quality  or 
low  quality.  Hutchins  claims  that  the  concept  of  good  quality  MT  output 
is  an  elusive  concept.  As  observed  by  Van  Slype  [214],  it  is  difficult  to  find 
an  objective  measure  of  any  type  of  translation,  machine- aided  or  otherwise. 
(In  fact,  there  is  no  quality  control  metric  for  human  translators.) 

In  this  section,  we  will  first  briefly  discuss  why  the  evaluation  of  transla¬ 
tions  is  so  elusive  and  then  describe  current  solutions  to  evaluation  of  trans¬ 
lations  and  MT  systems.  We  do  this  by  first  outlining  the  various  approaches 
that  can  be  taken  for  defining  evaluation  criteria  and  then  the  techniques 
that  can  be  applied  within  these  approaches. 

6.1  Evaluation  Challenges 

NL  applications,  such  as  MT,  have  some  unique  problems  that  must  be 
accounted  for  when  doing  evaluations.  The  biggest  problem  with  evaluating 
NL  applications  is  minimizing  the  subjectivity  that,  to  date,  has  proven 
unavoidable  due  to  the  nature  of  natural  language  itself.  Standard  software 
evaluation  techniques  must  be  enhanced  to  allow  for  the  multiple  “correct” 
answers  that  frequently  occur  with  natural  language.  It  is  not  clear  what 
constitutes  a  correct  answer  especially  when  dealing  with  translations.  It  is 
because  of  this  that  judging  the  correctness  of  the  output  for  MT  still  retains 
a  degree  of  subjectivity. 

As  pointed  out  in  [14]  there  are  no  neighboring  disciplines  to  which  we 
can  look  for  criteria  and  techniques.  There  is  no  general,  well- developed 
methodology  for  evaluating  software  systems  but  as  we  will  see  in  the  next 
section  there  are  some  evaluation  criteria  that  generally  apply  to  software 
systems.  Besides  the  lack  of  a  general  evaluation  methodology,  there  are  no 
clear  measures  for  human  translations  to  guide  us  and  for  that  matter  it  is 
questionable  whether  MT  systems  should  even  be  attempting  to  simulate 
the  behavior  of  human  translators.  According  to  Krauwer  [128],  the  human 
translator  metaphor  is  questionable  because,  while  the  output  quality  may 
improve  for  a  short  time,  it  most  likely  will  hit  a  point  of  little  or  no  improve¬ 
ment  given  our  current  technology.  He  further  suggests  that  it  is  better  for 
designers  and  users  to  negotiate  the  specifications  for  specialized  systems. 
The  evaluation  can  then  be  based  on  the  specifications.  Admittedly,  it  is 
still  not  an  easy  task  to  come  up  with  the  specifications  but  it  would  enable 
better  evaluations. 
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In  keeping  with  the  idea  of  writing  specifications  for  MT  systems,  we 
must  keep  in  mind  that  we  need  to  produce  an  output  that  suffices  for  the 
intended  use  (most  desirabiy  this  wouid  be  according  to  some  specification), 
and  we  must  do  this  cost-effectiveiy. 

6.2  Evaluation  Approaches 

The  approach  one  takes  when  evaluating  software  systems  (in  general)  is 
two- fold:  (1)  evaluation  of  the  accuracy  of  the  input/output  pairs;  and  (2) 
evaluation  of  the  architecture  of  the  system  and  the  data  flow  between  the 
system  components.  The  former  (external)  view  of  software  evaluation  is 
called  “black-box”  evaluation,  and  the  latter  (internal)  view  is  referred  to  as 
“glass-box”  evaluation  [167].  Black-box  evaluation  covers  engineering  issues 
such  as  reliability,  productivity,  user  learnability  and  user  friendliness.  Glass- 
box  evaluation  also  considers  reliability  (at  the  component-level)  as  well  as 
maintainability,  improvability,  extendibility,  compatibility  and  portability. 

Black-box  evaluation,  in  the  case  of  MT,  tends  to  focus  on  evaluating  the 
translation-quality  of  the  output.  Essentially  it  is  an  attempt  to  measure 
the  acceptability  of  the  translation  to  users.  To  produce  the  most  objective 
measure  possible,  a  standard  test-suite  of  input/output  pairs  should  be  es¬ 
tablished  for  judging  whether  the  system  is  performing  “correctly”  or  not 
and  whether  it  will  be  cost  effective.  In  light  of  the  above  discussion,  this 
is  a  very  costly  undertaking  and  has  yet  to  be  satisfactorily  accomplished  in 
any  evaluation  of  an  MT  system. 

Another  difficulty  in  applying  a  black-box  evaluation  approach  is  the 
number  of  dimensions  along  which  MT  developers  must  limit  their  systems. 
These  systems  can  be  thought  of  as  shells  that  are  customized  to  apply  to  a 
particular  domain,  language  pair,  and  type  of  text.  The  evaluation  criteria 
(i.e.  how  well  does  it  translate  these  texts)  must  also  be  limited  along  the 
same  dimensions,  but  there  is  no  common  range  among  the  systems.  Because 
of  this  lack  of  commonality,  some  systems  will  need  to  be  customized  for 
the  chosen  ranges  in  order  to  do  comparative  evaluations.  Comparative 
evaluations  would  be  the  goal  for  users  looking  to  purchase  an  MT  system. 
Researchers  are  also  interested  in  comparative  evaluations  to  determine  the 
effectiveness  of  their  MT  paradigm  or  micro-theory.  However,  the  most 
useful  information  in  this  case  tends  to  result  from  glass-box  approaches  to 
evaluation. 

The  glass-box  approach  attempts  to  evaluate  the  system’s  internal  pro¬ 
cessing  strategies  to  measure  how  well  the  system  does  something.  According 
to  the  ideas  for  evaluating  NLP  systems  [167],  this  type  of  evaluation  should 
include  a  determination  of  the  system’s  linguistic  coverage,  and  an  exam¬ 
ination  of  the  linguistic  theories  used  to  handle  the  linguistic  phenomena. 
Determining  the  linguistic  coverage  means  testing  what  linguistic  phenomena 
are  handled  and  to  what  degree.  The  examination  of  the  linguistic  theories 
used  includes  how  closely  these  theories  were  followed  in  the  implementation 
and  noting  what  modifications  had  to  be  made  to  the  theories.  In  addition, 
the  performance  of  the  system’s  various  modules  must  be  examined  and  the 
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evaluation  of  each  of  these  modules  should  be  treated  as  individual  black¬ 
box  evaluations.  Under  the  glass-box  evaluation  approach,  techniques  for 
measuring  improvability  have  received  the  most  attention. 

Considering  these  basic  evaluation  approaches,  what  then  are  reason¬ 
able  and  useful  evaluation  criteria  for  MT  systems?  There  are  a  number 
of  dimensions  along  which  one  can  make  a  judgement  of  the  quality  of  MT 
output.  Carbonell  et  al.  [46]  enumerates  the  following  external  evaluation 
criteria: 

1.  Semantic  Invariance:  Is  the  “meaning”  of  the  source  text  preserved  in 
the  target  text? 

2.  Pragmatic  Invariance:  Is  the  implicit  intent  or  illocutionary  force  (e.g., 
politeness,  urgency,  etc.)  of  the  source  text  preserved  in  the  target 
text? 

3.  Structural  invariance:  Is  the  syntactic  structure  of  the  source  text 
preserved  in  the  target  text? 

4.  Lexical  invariance:  is  there  a  one-to-one  mapping  of  words  or  phrases 
from  source  to  target  texts? 

5.  Spatial  invariance:  are  the  external  characteristics  of  the  source  text, 
such  as  length,  location  on  page,  etc.  preserved  in  the  target  text? 

Semantic  invariance  is  today  a  more  dominant  criterion  (in  contrast  to 
the  early  days  when  MT  systems  primarily  sought  to  preserve  lexical  in¬ 
variance).  In  general,  MT  systems  currently  seek  to  preserve  meaning  and 
style. 

Other  researchers  argue  that,  in  order  to  determine  which  criteria  are 
important  in  evaluating  a  MT  system,  we  must  first  know  what  type  of 
text  we  are  translating.  In  [147],  the  criteria  for  evaluation  are  determined 
on  the  basis  of  a  classification  of  the  different  types  of  text  that  are  to 
undergo  transformation  into  a  foreign  language.  For  example,  if  we  are 
translating  poetry,  we  would  want  to  preserve  pragmatic  invariance,  whereas 
if  we  are  translating  technical  and  scientific  material,  we  would  want  to 
preserve  semantic,  lexical,  and  possibly  spatial  invariance. 

If  translation  is  to  be  confined  to  technical  and  scientific  matter,  then  the 
text  is  generally  from  very  narrowly  defined  fields  that  restrict  the  lexicon 
and  grammar  and  constitute  a  sublanguage.  In  this  case  full  “understanding” 
is  less  likely  to  be  a  necessity  since  the  set  of  constructs  is  bounded  and  the 
vocabulary  is  limited;  thus,  a  small  set  of  simple  mappings  may  be  used. 

On  the  other  hand,  translating  free-text  is  a  much  harder  problem  than 
that  of  translating  texts  that  are  restricted  to  a  particular  sublanguage.  In 
order  to  make  an  evaluation  of  a  system  that  is  intended  to  translate  free- 
text,  we  need  to  look  at  the  degree  to  which  a  machine  translator  might 
make  mistakes  if  we  are  lenient  with  our  “understanding”  requirement.  We 
can  then  decide  if  it  is  possible  to  get  around  these  mistakes  without  adding 
a  high  degree  of  “understanding.” 
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Van  Slype  [214]  (quoted  from  Hutchins)  offers  additional  evaluation  cri¬ 
teria  for  black-box  approaches  to  evaluation  and  has  identified  a  number  of 
metrics  for  evaluating  the  degree  of  success  of  a  MT  system: 

1.  Intelligibility  of  output  text,  e.g.,  via  readability  scales. 

2.  Fidelity  to  the  SL  original,  e.g.,  via  measures  of  information  transfer. 

3.  Acceptability  to  recipient  of  translation. 

4.  Time  spent  in  revision  (post-editing). 

5.  Number  of  errors  corrected,  and  type. 

A  paper  by  Slocum  and  Justus  [197]  addresses  some  of  the  engineering 
measures  described  under  the  black-box  evaluation  approach  as  well  as  some 
of  the  measures  described  under  the  glass-box  evaluation  approach  in  addi¬ 
tion  to  usual  focus  on  improvability.  The  criteria  derived  from  this  paper 
are: 

1.  Cross-linguistic  applicability:  the  MT  system  must  support  several 
human  languages.  This  means  the  system  must  be  easily  extensible. 
In  particular,  adding  coverage  for  a  new  language  should  be  facilitated. 

2.  Performance:  the  MT  system  must  support  implementation  on  a  par¬ 
allel  architecture,  or  perform  decently  on  non-parallel  machines. 

3.  Eased  acquisition:  the  MT  system  must  be  built  on  top  of  syntac¬ 
tic,  semantic,  and  lexical  information  sources  that  are  easily  updated, 
perhaps  automatically. 

4.  Uniform  analysis  and  synthesis:  the  MT  system  should  have  rules  that 
are  used  during  both  types  of  processing. 

5.  Fault-tolerant,  fail-soft:  the  MT  system  should  have  adequate  error  re¬ 
covery  and  it  should  be  able  to  provide  an  understandable  explanation 
for  failures  (e.g.,  misspelled  word,  unanalyzable  syntax,  etc.). 

6.  Suitability  for  Speech  Input/Output:  the  MT  system  should  provide 
support  for  speech  processing  (e.g.,  it  should  provide  for  the  possibility 
that  word  boundaries  are  often  ignored  in  speech). 

Additional  evaluation  criteria  provided  in  a  paper  by  King  [118]  that  also 
fall  under  the  black-box  and  glass-box  approaches  to  evaluation  are: 

1.  Practicality:  the  MT  system  must  have  fail-back  mechanisms.  The  in¬ 
terface  structure  must  include  information  on  the  valency  boundedness 
of  constituents,  on  their  surface  syntactic  function,  etc.  so  that  when 
no  semantic  interpretation  is  available,  the  system  can  provide  some 
translation  rather  than  none  at  all.  (In  the  worst  case,  the  translation 
would  be  word-for-word.) 


36 


2.  Collaboration:  the  MT  system  should  be  built  by  means  of  joint  teams 
that  define  and  construct  the  sli arable  components  (e.g.  the  interlin¬ 
gua  or  the  transfer  rule  language).  There  must  be  an  agreement  to 
use  a  common  basic  software,  that  manipulates  an  agreed  upon  data 
structure. 

3.  Extensibility:  the  MT  system  must  provide  the  ability  to  add  new 
language  pairs  at  any  time  without  having  to  re-write  the  pre-existing 
system. 

Summarizing  all  the  criteria  given  above  into  a  final  list  is  difficult  since 
the  criteria  need  to  be  further  adapted  to  the  particular  type  of  text  that  is 
being  translated.  This  comment  notwithstanding,  we  consider  the  following 
criteria  to  be  crucial  in  the  evaluation  of  MT  systems: 

1.  Intelligibility  -  must  be  readable  and  reasonably  “natural.” 

2.  Fidelity  -  must  preserve  certain  characteristics  of  the  source  text  (e.g., 
must  support  structural  invariance). 

3.  Acceptability  -  must  be  satisfactory  for  intended  purpose  (e.g.,  must 
conform  to  properties  of  relevant  sublanguage). 

4.  Speed  -  must  have  reasonable  run-time. 

5.  Cost  -  must  be  cost-efficient. 

6.  Time  spent  for  revision  -  must  require  as  little  post-editing  as  possible. 

7.  Number  of  errors  -  must  not  have  an  unreasonably  large  number  of 
errors  (e.g.,  every  other  sentence  on  the  average). 

8.  Cross-linguistic  applicability  -  must  support  several  languages  in  a 
uniform  fashion. 

9.  Extensibility  -  must  provide  ability  to  easily  add  new  languages. 

10.  Uniform  analysis  and  synthesis  -  must  use  same  data  structures  for 
both  parsing  and  generation. 

11.  Fault-Tolerance  -  must  handle  errors  gracefully,  and  must  provide  some 
translation  rather  than  none  at  all. 

12.  Collaboration  -  all  languages  must  operate  on  basis  of  common  soft¬ 
ware  and  data  structures. 

Some  of  these  evaluation  criteria  require  further  definition  depending  on 
the  intended  purpose  of  the  MT  system.  For  example,  what  is  “natural”  in 
one  domain  may  not  be  “natural”  in  another  domain.  In  addition,  various 
measures  must  be  specified:  “reasonable  run-time”  might  be  different  for 
on-line  processing  vs.  off-line  processing;  “cost-efficient”  might  mean  one 
thing  to  one  end  user  and  something  else  to  another;  and  “unreasonably 
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large  number  of  errors”  might  mean  every  other  word  in  one  domain  and 
every  other  sentence  in  another  domain.  Also,  the  “graceful”  handing  of 
errors  depends  on  what  purpose  the  system  serves  (e.g.,  whether  the  system 
is  intended  to  operate  interactively  as  in  a  tutorial  situation,  or  whether  the 
system  is  intended  to  operate  as  a  batch  job). 

In  conjunction  with  the  intended  use  of  an  MT  system  is  the  notion  of 
the  different  purposes  behind  doing  an  evaluation.  An  end-user  of  an  MT 
system  will  approach  evaluation  differently  from  a  developer  or  a  researcher. 
Not  only  will  the  most  appropriate  techniques  for  these  different  types  of 
evaluators  differ  but  so  will  the  goals  they  have  for  doing  an  evaluation. 
Researchers  typically  work  with  test-suites  since  usually  they  are  focusing 
on  one  aspect  of  the  whole  problem  of  translation  at  one  particular  time  (e.g. 
a  theory  about  translation,  an  architecture,  or  a  technique  for  handling  a 
difficult  phenomena  within  a  particular  theoretical  framework).  A  developer 
will  use  test-suites  to  ensure  modifications  have  not  effected  sentences  that 
were  previously  correctly  translated  (regression  testing)  as  well  as  whether 
the  targeted  sentences  that  motivated  the  modification  are  now  correctly 
handled  by  the  change.  An  end-user  will  use  test-suites  to  comparatively 
evaluate  MT  systems  when  considering  a  purchase  and  will  also  use  test- 
suites  after  acquiring  a  system  and  arranging  for  system  extension  in  lexical 
or  grammatical  coverage. 

A  final,  frequently  overlooked  point  is  that  the  MT  paradigm  has  a  sig¬ 
nificant  impact  on  the  choice  of  evaluation  criteria  [14],  Today’s  statistical- 
based  and  example-based  paradigms  should  not  be  expected  to  rate  as  well 
on  fidelity,  intelligibility  and  acceptability,  for  example,  as  the  linguistics- 
based  paradigms.  On  the  other  hand,  we  would  expect  the  linguistics-based 
paradigms  to  be  less  fault -tolerant.  This  idea  meshes  well  with  the  intended 
use  of  an  MT  system.  A  statistical-based  approach  would  be  expected  to 
provide  rough  translations  more  cost-effectively. 

6.3  Evaluation  Techniques 

Test-suites  are  often  proposed  as  a  way  to  determine  a  system’s  linguistic 
coverage  and  can  be  useful  for  both  black-box  and  glass-box  approaches  to 
evaluation.  When  one  is  more  interested  in  the  types  of  errors  produced 
by  a  system  than  the  total  number  of  errors,  test-suites  are  most  often  the 
technique  used. 

To  construct  a  test-suite  one  must  attempt  to  predict  the  linguistic  con¬ 
structions  and  legal  combinations  of  these  constructions  that  will  be  encoun¬ 
tered  in  the  input.  In  addition,  it  is  important  to  include  illegal  constructions 
as  well  since  an  inability  to  recognize  the  construction  as  illegal  can  result  in 
poor  quality  output  as  well.  So  a  test  suite  could  contain  sentences  with  dif¬ 
ferent  verb  forms  and  auxiliaries  or  various  complex  sentence  structures  such 
as  sentences  with  restrictive  or  non-restrictive  relative  clauses,  or  conjoined 
clauses. 

However,  determining  the  appropriate  constructions  to  include  in  a  test- 
suite  is  difficult  and  the  size  of  the  test-suite  grows  quickly.  To  bound  the 
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problem,  the  test-suite  developers  must  know  what  linguistic  phenomena  are 
of  greatest  importance  to  the  users  and  be  well- versed  in  linguistics  and  the 
languages  of  interest  [120]. 

Test  suites  have  also  been  proposed  by  [120]  as  a  way  to  test  the  improv- 
ability  of  an  MT  system.  Improvability  tests  assume  that  either  the  evaluator 
is  working  closely  with  the  developer  or  that  the  evaluator  is  able  to  modify 
the  system.  The  caveats  mentioned  earlier  on  bounding  the  problem,  apply 
here  as  well. 

The  simplest  use  of  a  test  suite  is  to  run  the  system  on  it  and  record 
the  successes  and  failures.  This  then  gives  developers  and  perhaps  potential 
end-users  an  idea  of  what  constructions  are  problematic  as  well  an  idea  of  the 
overall  progress  being  made  in  the  development  of  the  MT  system.  However, 
unless  a  clear  record  is  made  of  what  constructions  and  interactions  the  input 
is  intended  to  test,  one  can  only  get  an  indication  of  the  overall  progress  in 
development.  The  developer  will  then  have  to  spend  time  examining  each 
failure  and  determining  exactly  what  went  wrong  in  the  system. 

Some  problems  with  test  suite  construction  as  noted  in  Arnold  et.  al.  [14] 

are: 

1.  The  projection  assumption:  the  assumption  is  that  it  is  possible  to 
determine  the  behaviour  of  the  system  on  the  real  input  from  the 
behaviour  on  the  test  suite.  The  test  suite  may  not  include  all  of  the 
phenomena  encountered  in  a  real  input. 

2.  Weighting  of  phenomena:  a  test  suite  does  not  indicate  the  weighting  of 
the  phenomena  according  to  what  one  would  expect  to  encounter  in  the 
real  input.  So  the  inability  to  handle  a  large  number  of  low  frequency 
phenomena  will  lead  one  to  expect  a  worse  performance  than  if  there 
is  a  problem  with  one  high  frequency  phenomena. 

3.  ft  is  necessary  to  take  source  and  target  languages  into  account.  For 
example,  “John  went  into  the  house”  would  test  past  tense  and  location 
prepostional  phrases  whereas  “John  entered  the  house”  would  test  past 
tense  as  well  as  structural  divergences  in  the  case  of  Spanish. 

A  final  technique  used  in  evaluation  is  to  collect  the  output  of  the  system 
and  evaluate  it  by  marking  and  categorizing  the  errors.  Another  technique 
for  evaluating  the  output  of  the  system  is  to  rate  the  output  according  to 
intelligibility  and  fidelity  scales.  In  both  cases  this  evaluation  is  done  by  hand 
and  tends  to  be  expensive,  tedious,  error-prone  and  subjective.  Marking  and 
categorizing  errors  requires  that  a  category  of  errors  be  defined  beforehand 
and  a  score  associated  with  each.  The  weighting  of  the  errors  tends  to  be 
subjective  unless  something  such  as  the  frequency  of  the  construction  in  the 
real  input  is  the  basis  of  the  scoring.  Likewise,  the  rating  of  intelligibility 
and  fidelity  requires  a  rating  scheme  and  will  be  subjective.  In  addition,  this 
type  of  rating  requires  many  test  inputs  and  evaluators  to  get  a  statistically 
significant  result.  In  both  cases  it  may  be  difficult  to  get  representative  test 
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material.  However,  these  approaches  tend  to  be  the  most  reasonable  means 
end-users  have  today  for  evaluating  MT  systems. 

Test-suite  evaluations  can  also  be  time-consuming,  tedious  and  error 
prone.  A  number  of  tools  are  being  researched  to  help  with  test-suite  evalua¬ 
tion.  Shiwen  [193]  has  a  tool  for  scoring  test-suite  results.  This  tool  provides 
a  language  to  associate  input  strings  with  patterns  representing  acceptable 
outputs  and  scores.  Arnold  et  al.  [11]  describe  a  tool  for  test-suite  con¬ 
struction  which  uses  a  simple  grammar  to  generate  a  test-suite.  The  tool 
described  by  Nerbonne  et  al.  [150]  records  in  a  relational  datbase  the  phe¬ 
nomena  tested  by  a  particular  construction  so  that  a  test-suite  can  be  built 
from  this  database  by  indicating  the  grammatical  constructions  that  need 
to  be  tested. 
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